Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern. This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
After analyzing the data following reproducible research techniques, we found that there is extensive overlap of the main weather sources for both casualties (including injuries and fatalities) and monetary damage (including damage done to both crops and property.) In both instances, the main phenomena to blame can be centered around flash floods, tornadoes, excessive heat waves, thunderstorm winds, and hail.
The base of the analysis was the U.S. National Geographic and Atmospheric Administration’s (NOAA) storm database. This storm tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries and property damages. The database comes in the form of a comma-separated value file compressed via the bzip2 algorithm to reduce its size, and can be downloaded from the following source:
National Weather Service Storm Data Documentation
The brief for data analysis included the following steps.
Several R libraries where used in preparation for the analysis.
library(R.utils)
## Warning: package 'R.utils' was built under R version 3.2.5
library(dplyr)
library(xtable)
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.2.4
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 3.2.4
library(reshape)
The libraries include utilities for file management, HTML output, and plotting.
In order to download the data from source, the following procedure was utilized. Note that all data manipulation happens on code without any manual transformation.
URL <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
bz2File <- "repdata_data_StormData.csv.bz2"
studyFile <- "StormData.csv"
download.file(URL, bz2File, mode = "wb")
bunzip2(bz2File, studyFile, remove = FALSE)
stormData <- read.csv("StormData.csv", as.is = TRUE, comment.char = "")
str(stormData)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
Two major changes where made in the data file. The first one was to transform dates in the BGN_DATE column from string characters to date classes in R, storing the value in the DATE variable so as not to disrupt the original data structure. Once the change was performed, the DATE variable was used to filter data from the specific date January 1st, 2000. This is to analyze the last eleven years of weather phenomena and give a much clearer trend of weather patterns and disruption to both economic and life activities.
stormData$DATE <- as.POSIXct(stormData$BGN_DATE, format = "%m/%d/%Y %H:%M:%S")
stormData <- filter(stormData, DATE >= "2000/01/01")
For the purpose of the analysis, several constants where calculated, all of them pertaining to accumulators later used to calculate percentages and tendencies of data.
## Counter Variables Used in Later Calculations
totalFatalities <- sum(stormData$FATALITIES)
totalInjuries <- sum(stormData$INJURIES)
totalPropDmg <- sum(stormData$PROPDMG)
totalCropDmg <- sum(stormData$CROPDMG)
totalDmg <- (sum(stormData$PROPDMG) + sum(stormData$CROPDMG))
This is all the data wrangling involved in the analysis, being the questions rather straightforward to answer wit the statistical data obtained in regards to number of casualties, fatalities, and types of damage from storms and other weather phenomena.
In order to best answer which weather phenomena cause the most negative impact on human welfare, the data set was grouped by weather phenomena and the ranked in accordance to the tally of injuries and fatalities. Injuries was our initial point of search. The variable INJURIES registers the number of injuries caused by a particular weather event (identified by the variable EVT_TYPE) in a given point in time (identified by the transformed variable DATE.) By grouping and accumulating the registers, we can analyze the top ten sources of injuries as described by the following code:
## Extract most injuries by event type since year 2000
injuries <- group_by(stormData, EVTYPE) %>%
summarize(sumInjuries = sum(INJURIES),
percInjuries = sum(INJURIES) / totalInjuries) %>%
arrange(desc(sumInjuries))
head(injuries, 10)
## Source: local data frame [10 x 3]
##
## EVTYPE sumInjuries percInjuries
## (chr) (dbl) (dbl)
## 1 TORNADO 15213 0.43301170
## 2 EXCESSIVE HEAT 3708 0.10554180
## 3 LIGHTNING 2993 0.08519056
## 4 TSTM WIND 1753 0.04989611
## 5 THUNDERSTORM WIND 1400 0.03984858
## 6 HURRICANE/TYPHOON 1275 0.03629067
## 7 HEAT 1222 0.03478211
## 8 WILDFIRE 911 0.02593004
## 9 FLASH FLOOD 812 0.02311217
## 10 HIGH WIND 677 0.01926963
Some preliminary observations from the previous table.
Injuries are a major key classifier for weather phenomena. The impact on the lives of people (and in this case not one or hundreds but thousands) are pivotal for assessing contention plans and proper budgets for future weather alterations. However more impacting than injuries are fatalities forthcoming from acute weather patterns.
The tally of fatalities as given by extreme weather patterns is very similar to the algorithm utilized for injuries, if not almost identical. The variable FATALITIES registers the number of fatalities caused by a particular weather event (identified by the variable EVT_TYPE) in a given point in time (identified by the transformed variable DATE.) By grouping and accumulating the registers, we can analyze the top ten sources of fatalities as described by the following code:
## Extract most fatalities in event type since year 2000
fatalities <- group_by(stormData, EVTYPE) %>%
summarize(sumFatalities = sum(FATALITIES),
percFatalities = sum(FATALITIES)/totalFatalities) %>%
arrange(desc(sumFatalities))
head(fatalities, 10)
## Source: local data frame [10 x 3]
##
## EVTYPE sumFatalities percFatalities
## (chr) (dbl) (dbl)
## 1 TORNADO 1193 0.19903237
## 2 EXCESSIVE HEAT 1013 0.16900234
## 3 FLASH FLOOD 600 0.10010010
## 4 LIGHTNING 466 0.07774441
## 5 RIP CURRENT 340 0.05672339
## 6 FLOOD 266 0.04437771
## 7 HEAT 231 0.03853854
## 8 AVALANCHE 179 0.02986320
## 9 HIGH WIND 131 0.02185519
## 10 THUNDERSTORM WIND 130 0.02168836
Not surprisingly, many of the weather phenomena repeat themselves in the table above. We can review that:
There is a lot of overlap among sources for casualties and injuries. Tables make for clear definitions, but a plot comparing both classifications in probably a more visual way to understand the importance and weight of each phenomena. Bar plots were created for visualizing.
## Plot to compare effects of events on fatalities and injuries
plot1 <- ggplot(data = fatalities[1:10,], aes(x = factor(EVTYPE), y = sumFatalities)) +
geom_bar(stat = "identity") + coord_flip() +
scale_x_discrete(limits = fatalities$EVTYPE[10:1]) +
xlab("Type of Event") + ylab("Sum of Fatalities") +
ggtitle("Ranking of Fatalities by Event Type (2000-Present)") +
scale_fill_brewer(palette = "Greys") +
theme(axis.text=element_text(size=8), axis.title=element_text(size=11,face="bold"))
plot2 <- ggplot(data = injuries[1:10,], aes(x = factor(EVTYPE), y = sumInjuries)) +
geom_bar(stat = "identity") + coord_flip() +
scale_x_discrete(limits = injuries$EVTYPE[10:1]) +
xlab("Type of Event") + ylab("Sum of Injuries") +
ggtitle("Ranking of Injuries by Event Type (2000-Present)") +
scale_fill_brewer(palette = "Greys") +
theme(axis.text=element_text(size=8), axis.title=element_text(size=11,face="bold"))
grid.arrange(plot1, plot2, nrow = 2)
The scope of this work is to analyze in a reproducible manner the major sources of fatalities and injuries from weather events without necessarily giving any recommendations on the matter. However, we thought interesting to offer a way to categorize weather events by assigning classification using both injuries and fatalities as weight categories. We call this classification casualties, and it is composed of both measures of injuries and fatalities per weather phenomena, ranked according to a) injuries in first order, and b) fatalities in second order.
## Extract the most fatalities and injuries by event since year 2000
casualties <- group_by(stormData, EVTYPE) %>%
summarize(sumInjuries = sum(INJURIES),
sumFatalities = sum(FATALITIES),
percInjuries = sum(INJURIES) / totalInjuries,
percFatalities = sum(FATALITIES) / totalFatalities) %>%
arrange(desc(sumInjuries, sumFatalities))
head(casualties, 10)
## Source: local data frame [10 x 5]
##
## EVTYPE sumInjuries sumFatalities percInjuries percFatalities
## (chr) (dbl) (dbl) (dbl) (dbl)
## 1 TORNADO 15213 1193 0.43301170 0.19903237
## 2 EXCESSIVE HEAT 3708 1013 0.10554180 0.16900234
## 3 LIGHTNING 2993 466 0.08519056 0.07774441
## 4 TSTM WIND 1753 116 0.04989611 0.01935269
## 5 THUNDERSTORM WIND 1400 130 0.03984858 0.02168836
## 6 HURRICANE/TYPHOON 1275 64 0.03629067 0.01067734
## 7 HEAT 1222 231 0.03478211 0.03853854
## 8 WILDFIRE 911 75 0.02593004 0.01251251
## 9 FLASH FLOOD 812 600 0.02311217 0.10010010
## 10 HIGH WIND 677 131 0.01926963 0.02185519
This is perhaps a more opinionated way to prioritize effects of weather phenomena, but one which we feel helps those in charge of making contingency plans and forecasting resources accordingly. It is also easier to visualize ranking of weather effects like so:
## Build a plot of the top 10 casualty types since year 2000, comparing injuries and fatalities side by side
subset <- data.frame(casualties$EVTYPE, casualties$sumInjuries, casualties$sumFatalities)
colnames(subset) <- c("TYPE", "INJURIES", "FATALITIES")
subset <- melt(subset[1:10,], id = c("TYPE"))
ggplot(subset, aes(factor(TYPE), value, fill = variable)) +
geom_bar(stat="identity", position = "dodge") +
scale_x_discrete(limits = subset$TYPE) +
scale_fill_brewer(palette = "Paired") +
xlab("Event Types") + ylab("Casualties") + ggtitle("Sources of Casualties by Event Type (2000-Present)") +
theme(axis.text=element_text(size=10), axis.title=element_text(size=12,face="bold"))
From the above plot it’s clear that the priority on assigning resources and contigency plans against waether events should be concentrated on tornadoes, events of excessive heat, flash floods, and thunderstorm winds.
The effect of weather events on human life can be devastating. But the after-effects of damage to property and crops is also a variable to take into consideration. If not fatal to human life, it has a devastating effect on communities and government bodies who will have to deal with the financial consequences.
The NOAA database provides exacting figures of damage for both property and crop. Given that the economical nature of both variable has less disparity for comparison than injuries versus fatalities, the analysis becomes simpler in nature. We query the total losses incurred in crop damage (given by the variable CROPDMG) and property damages (given by the variable PROPDMG) tallied by weather event type (using the variable EVTYPE for grouping.) We also build a new variable, totalDmg for total damages, summing the latter two, again grouping by event type. The ranking is given by three variables, total damages first (totalDmg), property damage second (sumPropDmg), and crop damage for tie breaking (sumCropDmg.)
## What type of events cause the most damage since the year 2000
damages <- group_by(stormData, EVTYPE) %>%
summarize(totalDmg = sum(PROPDMG) + sum(CROPDMG),
sumPropDmg = sum(PROPDMG),
sumCropDmg = sum(CROPDMG),
percDmg = (sum(PROPDMG) + sum(CROPDMG)) / (totalPropDmg + totalCropDmg),
percPropDmg = sum(PROPDMG) / totalPropDmg,
percCropDmg = sum(CROPDMG) / totalCropDmg) %>%
arrange(desc(totalDmg, sumPropDmg, sumCropDmg))
head(damages, 10)
## Source: local data frame [10 x 7]
##
## EVTYPE totalDmg sumPropDmg sumCropDmg percDmg
## (chr) (dbl) (dbl) (dbl) (dbl)
## 1 FLASH FLOOD 1131715.05 999333.42 132381.63 0.16492912
## 2 TORNADO 980746.61 907111.70 73634.91 0.14292792
## 3 THUNDERSTORM WIND 928920.36 862257.36 66663.00 0.13537509
## 4 TSTM WIND 865286.92 811528.22 53758.70 0.12610154
## 5 HAIL 815812.65 452533.47 363279.18 0.11889147
## 6 FLOOD 792567.18 671747.56 120819.62 0.11550382
## 7 LIGHTNING 397297.29 395884.69 1412.60 0.05789964
## 8 HIGH WIND 259038.15 247108.53 11929.62 0.03775061
## 9 WINTER STORM 97746.93 97093.93 653.00 0.01424503
## 10 WILDFIRE 87371.54 83007.34 4364.20 0.01273299
## Variables not shown: percPropDmg (dbl), percCropDmg (dbl)
The table not only includes total damages for both crop and property by event type, but also a percentage indicator to facilitate prioritizing variables. Again we see overlap of the same types of events that affected injuries and fatalities. Flash floods are the number one cause for economic damages to both property and crops, followed closely by tornadoes. Thunderstorm winds occupy the third and fourth place (again, this might just be a lack of nomenclature discrepancy on the NOAA database), while flood becomes the number fifth source.
A plot bearing total damage by source of event makes the analysis easy to grasp. For purposes of easy visual understanding, we subset the data using just three variables (event type, property damage, and crop damage) to accelerate and simplify the plot.
## Build a plot of the top 10 damage types since year 2000, comparing property & crop damage
subset2 <- data.frame(damages$EVTYPE, damages$sumPropDmg, damages$sumCropDmg)
colnames(subset2) <- c("TYPE", "PROPERTY", "CROP")
subset2 <- melt(subset2[1:10,], id = c("TYPE"))
ggplot(subset2, aes(factor(TYPE), value, fill = variable)) +
geom_bar(stat="identity", position = "stack") +
scale_fill_brewer(palette = "Paired") + scale_x_discrete(limits = subset2$TYPE) +
xlab("Event Type") + ylab("Monetary Damage USD") + ggtitle("Sources of Economical Damage by Weather Event (2000-Present)") +
theme(axis.text=element_text(size=10), axis.title=element_text(size=12,face="bold"))
The plot only reinforces the fact that the most costly weather events measured from a property and crop damage perspective are:
With these weather events in mind, we deem much easier to plan ahead for severe weather events and prioritize resource allocation.
– END