Author: Cliff Weaver
Coursera Reproducible Reseach by Johns Hopkins
Date: August 22, 2015
Many severe weather events can result in fatalities, injuries, and property damage. Using data from the National Oceanic and Atmospheric Administration (NOAA), this research suggests tornadoes have caused more harm than other types of weather events. Floods have caused more economic damage than other weather events.
Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
Storm Data is an official publication of the National Oceanic and Atmospheric Administration (NOAA) which documents:
The occurrence of storms and other significant weather phenomena having sufficient intensity to cause loss of life, injuries, significant property damage, and/or disruption to commerce;
Other significant meteorological events, such as record maximum or minimum temperatures or precipitation that occur in connection with another event.
The data for this analysis come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file from the course web site:
[48MB][http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2]
NOTE: For this analysis, the data was downloaded on August 16, 2015.
Documentation of the database available. Here you will find how some of the variables are constructed/defined.
The events in the database start in the year 1950 and end in November 2011. In the earlier years there are fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
There are 37 variables included in this dataset. Only a subset of the variables is used for this analysis. BGM_DATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG and CROPDMGEXP are used in the analysis. BGN_DATE is the date of the weather event, EVTYPE defines the type of storm event, FATALITIES and INJURIES define number of fatalities and injuries caused as a result of an event, PROPDMG and PROPDMGEXP define cost incurred from property damage, and CROPDMG AND CROPDMGEXP define cost incurred from crop damage. A total of 903,297 records are in the dataset.
To tidy the data and prepare it for analysis, we need to load a few R libraries:
library(Hmisc)
library(dplyr)
library(ggplot2)
library(tidyr)
library(lubridate)
The data comes in the form of comma-separated-value file compressed via the bzip2 algorithm. Download, extract and read into a new data frame.
if (!file.exists("./data/workingData.csv"))
{
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
destfile="../Class5Assign2/data/rawData.csv.bz2")
file = bzfile('./data/rawData.csv.bz2')
rawData <- read.csv(file)
#Let's preserve the raw data and creat a working copy of the dataset
workingData <- rawData %>% select(BGN_DATE, EVTYPE, FATALITIES, INJURIES,PROPDMG,PROPDMGEXP, CROPDMG, CROPDMGEXP)
#Let's save a copy of the working data
write.csv(workingData, "./data/workingData.csv")
}
workingData <- read.csv("./data/workingData.csv", stringsAsFactors = FALSE)
Because weather reporting early in the dataset was inconsistent, we are going to evaluate weather events later than 1995:
workingData$BGN_DATE <- mdy_hms(workingData$BGN_DATE)
workingData <- workingData %>% filter(year(BGN_DATE) > 1995)
# convert to local data frame. Printing only shows 10 rows and as many columns as can fit on your screen
workingData <- tbl_df(workingData)
Let’s take a look at the data:
glimpse(workingData)
## Observations: 653530
## Variables:
## $ X (int) 248768, 248769, 248770, 248771, 248772, 248773, 248...
## $ BGN_DATE (time) 1996-01-06, 1996-01-11, 1996-01-11, 1996-01-11, 19...
## $ EVTYPE (chr) "WINTER STORM", "TORNADO", "TSTM WIND", "TSTM WIND"...
## $ FATALITIES (int) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ INJURIES (int) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ PROPDMG (dbl) 380, 100, 3, 5, 2, 0, 400, 12, 8, 12, 0, 75, 2, 0, ...
## $ PROPDMGEXP (chr) "K", "K", "K", "K", "K", NA, "K", "K", "K", "K", NA...
## $ CROPDMG (dbl) 38, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ CROPDMGEXP (chr) "K", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
The dataset contains letters like K, M, and B representing the monetary values of 1,000, 1,000,000 and 1,000,000,000, respectively.
Let’s switch the letters for numbers so we have numerical data to evaluate. We need to perform this on property and crop
damage fields. We’ll then multiply the number of events for each level of damage.
#Property damage
workingData$PROPDMGEXP <- gsub('K', '1000', workingData$PROPDMGEXP)
workingData$PROPDMGEXP <- gsub('M', '1000000', workingData$PROPDMGEXP)
workingData$PROPDMGEXP <- gsub('M', '1000000000', workingData$PROPDMGEXP)
workingData$PROPDMGEXP <- as.numeric(workingData$PROPDMGEXP, na.rm=TRUE)
#Multiply the PROPDMG values with their respective PROPDMGEXP values
workingData$PROPDMG <- workingData$PROPDMG * workingData$PROPDMGEXP
#Crop Damage
workingData$CROPDMGEXP <- gsub('K', '1000', workingData$CROPDMGEXP)
workingData$CROPDMGEXP <- gsub('M', '1000000', workingData$CROPDMGEXP)
workingData$CROPDMGEXP <- gsub('B', '1000000000', workingData$CROPDMGEXP)
workingData$CROPDMGEXP <- as.numeric(workingData$CROPDMGEXP)
#Multiply the CPROPDMG values with their respective CPROPDMGEXP values
workingData$CROPDMG <- workingData$CROPDMG * workingData$CROPDMGEXP
glimpse(workingData)
## Observations: 653530
## Variables:
## $ X (int) 248768, 248769, 248770, 248771, 248772, 248773, 248...
## $ BGN_DATE (time) 1996-01-06, 1996-01-11, 1996-01-11, 1996-01-11, 19...
## $ EVTYPE (chr) "WINTER STORM", "TORNADO", "TSTM WIND", "TSTM WIND"...
## $ FATALITIES (int) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ INJURIES (int) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ PROPDMG (dbl) 380000, 100000, 3000, 5000, 2000, NA, 400000, 12000...
## $ PROPDMGEXP (dbl) 1000, 1000, 1000, 1000, 1000, NA, 1000, 1000, 1000,...
## $ CROPDMG (dbl) 38000, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ CROPDMGEXP (dbl) 1000, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
The data is looking pretty good but it needs to be grouped and summarized before plots can be developed. The code below also creates to new fields - totalDeathInjury and totalDollarLost. These will be used a bit later in the analysis. (NAs are disregarded.) Using the dplyr package, we manipulate the dataset:
totals <- workingData %>%
group_by(EVTYPE) %>%
summarise(FATALITIES = sum(FATALITIES), INJURIES = sum(INJURIES),PROPDMG = sum(PROPDMG, na.rm = TRUE),
CROPDMG = sum(CROPDMG, na.rm = TRUE)) %>%
mutate(totalDeathInjury=FATALITIES + INJURIES, totalDollarLost= PROPDMG + CROPDMG)
glimpse(totals)
## Observations: 516
## Variables:
## $ EVTYPE (chr) " HIGH SURF ADVISORY", " COASTAL FLOOD", " ...
## $ FATALITIES (int) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ INJURIES (int) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ PROPDMG (dbl) 200000, 0, 50000, 0, 8100000, 8000, 0, 0, 0, ...
## $ CROPDMG (dbl) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 28820000,...
## $ totalDeathInjury (int) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ totalDollarLost (dbl) 200000, 0, 50000, 0, 8100000, 8000, 0, 0, 0, ...
We now have the data needed to evaluate the 2 questions to complete the analysis:
totalsPlot <- totals %>% arrange(desc(totalDeathInjury))
totalsPlot1 <- totalsPlot[1:10,]
totalsPlot1 <- totalsPlot1 %>%
gather(PhysicalHarm, Injuries, FATALITIES:INJURIES)
ggplot(totalsPlot1, aes(x=reorder(EVTYPE, -totalDeathInjury), y=Injuries, fill=PhysicalHarm, order=desc(PhysicalHarm))) + #works
geom_bar(stat="identity") +
theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
#geom_text(aes(label=totalDeathInjury)) +
ggtitle(paste("Most Harmful Weather Events ")) +
xlab("Weather Event") + ylab("Total Deaths and Injuries") +
theme(legend.position=c(.8,.8)) +
scale_fill_brewer(palette = "Dark2", labels=c("Fatalities", "Injuries")) + #http://www.cookbook-r.com/Graphs/Colors_(ggplot2)
labs(fill="Physical Harm")
For the purposes of this analysis, harm is defined as the sum of deaths and injuries. As the plot above illustrates, tornadoes have caused more physical harm than other types of weather events. This satisfies the first requirement for this analysis.
totalsPlot <- totals %>% arrange(desc(totalDollarLost))
totalsPlot2 <- totalsPlot[1:10,]
totalsPlot2 <- totalsPlot2 %>%
gather(dollarHarm, dollarDamage, PROPDMG:CROPDMG)
ggplot(totalsPlot2, aes(x=reorder(EVTYPE, -totalDollarLost), y=dollarDamage/1000000000, fill=dollarHarm, order=desc(dollarHarm))) + #works stacked
geom_bar(stat="identity") +
theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
#geom_text(aes(label=totalDeathInjury)) +
ggtitle(paste("Most Costly Weather Events ")) +
xlab("Weather Event") + ylab("Total Economic Cost - Billions $") +
theme(legend.position=c(.8,.8)) +
scale_fill_brewer(palette = "Dark2", labels=c("Property", "Crop")) + #http://www.cookbook-r.com/Graphs/Colors_(ggplot2)
labs(fill="Cost")
Floods have caused more economic damage than other weather events. This answers the second question and completes the analysis
While this analysis identifies the weather events that have caused the most lethal and costly impacts since 1995 in the U.S., this answers only a few of the many questions the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database might address. Some other interesting questions that could be explored are:
This is just a sampling of the interesting questions that could be explored.
sessionInfo()
## R version 3.2.1 (2015-06-18)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 8 x64 (build 9200)
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] grid stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] lubridate_1.3.3 tidyr_0.2.0 dplyr_0.4.2 Hmisc_3.16-0
## [5] ggplot2_1.0.1 Formula_1.2-1 survival_2.38-3 lattice_0.20-31
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.11.6 formatR_1.2 RColorBrewer_1.1-2
## [4] plyr_1.8.3 tools_3.2.1 rpart_4.1-10
## [7] digest_0.6.8 evaluate_0.7 memoise_0.2.1
## [10] gtable_0.1.2 DBI_0.3.1 parallel_3.2.1
## [13] proto_0.3-10 gridExtra_0.9.1 stringr_1.0.0
## [16] knitr_1.10.5 cluster_2.0.2 nnet_7.3-10
## [19] R6_2.1.0 foreign_0.8-65 rmarkdown_0.7
## [22] latticeExtra_0.6-26 reshape2_1.4.1 magrittr_1.5
## [25] scales_0.2.5 htmltools_0.2.6 MASS_7.3-42
## [28] splines_3.2.1 assertthat_0.1 colorspace_1.2-6
## [31] labeling_0.3 stringi_0.5-5 acepack_1.3-3.3
## [34] lazyeval_0.1.10 munsell_0.4.2