Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
National Climatic Data Cente (NCDC) receives Storm Data from the National Weather Service. The National Weather service receives their information from a variety of sources, which include but are not limited to: county, state and federal emergency management officials, local law enforcement officials, skywarn spotters, NWS damage surveys, newspaper clipping services, the insurance industry and the general public. Due to the amount of time it takes to collect, validate, and enter post Storm Data information, the National Climatic Data Center (NCDC) regularly receives Storm Data from the National Weather Service (NWS) approximately 60-90 days after the end of the data month.
An effort is made to use the best available information but because of time and resource constraints, information from these sources may be unverified by the NWS. Therefore, when using information from Storm Data, customers should be cautious as the NW does not guarantee the accuracy or validity of the information.
The National Weather Service makes a best guess using all available data at the time of the publication The damage amounts are received from a variety of sources, including those listed above in the Dat Sources section. Property and Crop damage should be considered as a broad estimate.
The basic goal of this assignment is to explore the NOAA Storm Database and answer some basic questions about severe weather events.
Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?
For this assignment below tools are used :
RStudio: RStudio to publish completed analysis document to RPubs. RStudio is used to edit/write this analysis.
knitr: The knitr package is used in order to compile the R Markdown document and convert it to HTML
The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file from the course web site:
There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.
National Weather Service Storm Data Documentation
National Climatic Data Center Storm Events FAQ
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
echo = TRUE # Always make code visible
#options(scipen = 1) # Turn off scientific notations for numbers
library(R.utils)
## Loading required package: R.oo
## Loading required package: R.methodsS3
## R.methodsS3 v1.6.1 (2014-01-04) successfully loaded. See ?R.methodsS3 for help.
## R.oo v1.18.0 (2014-02-22) successfully loaded. See ?R.oo for help.
##
## Attaching package: 'R.oo'
##
## The following objects are masked from 'package:methods':
##
## getClasses, getMethods
##
## The following objects are masked from 'package:base':
##
## attach, detach, gc, load, save
##
## R.utils v1.34.0 (2014-10-07) successfully loaded. See ?R.utils for help.
##
## Attaching package: 'R.utils'
##
## The following object is masked from 'package:utils':
##
## timestamp
##
## The following objects are masked from 'package:base':
##
## cat, commandArgs, getOption, inherits, isOpen, parse, warnings
library(ggplot2)
library(plyr)
require(gridExtra)
## Loading required package: gridExtra
## Loading required package: grid
First, download the data file if already not downloaded
if (!"stormData.csv.bz2" %in% dir("./data/")) {
print("downloading again")
download.file("http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", destfile = "data/stormData.csv.bz2")
bunzip2("data/stormData.csv.bz2", overwrite=T, remove=F)
}
Then, we read the generated csv file. If the data already exists in the working environment, we do not need to load it again. Otherwise, we read the csv file.
if (!"stormData" %in% ls()) {
print("Loading again")
stormData <- read.csv("data/stormData.csv", sep = ",")
}
## [1] "Loading again"
dim(stormData)
## [1] 902297 37
head(stormData, n = 2)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO 0 0
## 2 TORNADO 0 0
## COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1 NA 0 14 100 3 0 0
## 2 NA 0 2 150 2 0 0
## INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1 15 25.0 K 0
## 2 0 2.5 K 0
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3040 8812 3051 8806 1
## 2 3042 8755 0 0 2
In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
if (dim(stormData)[2] == 37) {
stormData$year <- as.numeric(format(as.Date(stormData$BGN_DATE, format = "%m/%d/%Y %H:%M:%S"), "%Y"))
}
hist(stormData$year, breaks = 30)
Based on the above histogram, we see that the number of events tracked starts to significantly increase around 1995. So, we use the subset of the data from 1995 to 2011 to get most out of good records.
storm <- stormData[stormData$year >= 1995, ]
dim(storm)
## [1] 681500 38
In this section, we check the number of fatalities and injuries that are caused by the severe weather events. We would like to get the first 15 most severe types of weather events.
sortHelper <- function(fieldName, top = 15, dataset = stormData) {
index <- which(colnames(dataset) == fieldName)
field <- aggregate(dataset[, index], by = list(dataset$EVTYPE), FUN = "sum")
names(field) <- c("EVTYPE", fieldName)
field <- arrange(field, field[, 2], decreasing = T)
field <- head(field, n = top)
field <- within(field, EVTYPE <- factor(x = EVTYPE, levels = field$EVTYPE))
return(field)
}
fatalities <- sortHelper("FATALITIES", dataset = storm)
injuries <- sortHelper("INJURIES", dataset = storm)
We will convert the property damage and crop damage data into comparable numerical forms according to the meaning of units described in the code book (Storm Events). Both PROPDMGEXP and CROPDMGEXP columns record a multiplier for each observation where we have Hundred (H), Thousand (K), Million (M) and Billion (B).
convertHelper <- function(dataset = storm, fieldName, newFieldName) {
totalLen <- dim(dataset)[2]
index <- which(colnames(dataset) == fieldName)
dataset[, index] <- as.character(dataset[, index])
logic <- !is.na(toupper(dataset[, index]))
dataset[logic & toupper(dataset[, index]) == "B", index] <- "9"
dataset[logic & toupper(dataset[, index]) == "M", index] <- "6"
dataset[logic & toupper(dataset[, index]) == "K", index] <- "3"
dataset[logic & toupper(dataset[, index]) == "H", index] <- "2"
dataset[logic & toupper(dataset[, index]) == "", index] <- "0"
dataset[, index] <- as.numeric(dataset[, index])
dataset[is.na(dataset[, index]), index] <- 0
dataset <- cbind(dataset, dataset[, index - 1] * 10^dataset[, index])
names(dataset)[totalLen + 1] <- newFieldName
return(dataset)
}
storm <- convertHelper(storm, "PROPDMGEXP", "propertyDamage")
## Warning in convertHelper(storm, "PROPDMGEXP", "propertyDamage"): NAs
## introduced by coercion
storm <- convertHelper(storm, "CROPDMGEXP", "cropDamage")
## Warning in convertHelper(storm, "CROPDMGEXP", "cropDamage"): NAs
## introduced by coercion
names(storm)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE"
## [5] "COUNTY" "COUNTYNAME" "STATE" "EVTYPE"
## [9] "BGN_RANGE" "BGN_AZI" "BGN_LOCATI" "END_DATE"
## [13] "END_TIME" "COUNTY_END" "COUNTYENDN" "END_RANGE"
## [17] "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES"
## [25] "PROPDMG" "PROPDMGEXP" "CROPDMG" "CROPDMGEXP"
## [29] "WFO" "STATEOFFIC" "ZONENAMES" "LATITUDE"
## [33] "LONGITUDE" "LATITUDE_E" "LONGITUDE_" "REMARKS"
## [37] "REFNUM" "year" "propertyDamage" "cropDamage"
options(scipen=999)
property <- sortHelper("propertyDamage", dataset = storm)
crop <- sortHelper("cropDamage", dataset = storm)
The effect of events on public health is explained below.
We have two datasets fatalities and injuries to explore.
And the following is a pair of graphs of total fatalities and total injuries affected by these severe weather events.
fatalitiesPlot <- qplot(EVTYPE, data = fatalities, weight = FATALITIES, geom = "bar", binwidth = 1, fill=I("darkorange")) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_y_continuous("Number of Fatalities") +
xlab("Severe Weather Type") +
ggtitle("Total Fatalities \nUSA from 1995 - 2011")
injuriesPlot <- qplot(EVTYPE, data = injuries, weight = INJURIES, geom = "bar", binwidth = 1, fill = I("darkorange4")) +
scale_y_continuous("Number of Injuries") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
xlab("Severe Weather Type") +
ggtitle("Total Injuries \nUSA from 1995 - 2011")
grid.arrange(fatalitiesPlot, injuriesPlot, ncol=2)
Based on the above histograms, we find that excessive heat and tornado cause most fatalities; tornado causes most injuries in the United States from 1995 to 2011.
As for the impact on economy, we have got two sorted lists below by the amount of money cost by damages.
property
crop
And the following is a pair of graphs of total property damage and total crop damage affected by these severe weather events.
propertyPlot <- qplot(EVTYPE, data = property, weight = propertyDamage, geom = "bar", binwidth = 1, fill=I("goldenrod")) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_y_continuous("Property Damage in US dollars") +
xlab("Severe Weather Type") +
ggtitle("Total Property Damage by\n USA from 1995 - 2011")
cropPlot<- qplot(EVTYPE, data = crop, weight = cropDamage, geom = "bar", binwidth = 1, fill = I("goldenrod4")) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) + scale_y_continuous("Crop Damage in US dollars") +
xlab("Severe Weather Type") + ggtitle("Total Crop Damage by \nUSA from 1995 - 2011")
grid.arrange(propertyPlot, cropPlot, ncol=2)
Based on the above histograms, we find that flood and hurricane/typhoon cause most property damage; drought and flood causes most crop damage in the United States from 1995 to 2011.
From these data, we found that excessive heat and tornado are most harmful with respect to population health, while flood, drought, and hurricane/typhoon have the greatest economic consequences.