Student: Herbert Barrientos
Date: 2016-02-26
Course: Reproducible Research - Assignment Nr. 2
Institution: Johns Hopkins University via Coursera
Intended audience: Per assignment instructions, “Consider writing your report as if it were to be read by a government or municipal manager who might be responsible for preparing for severe weather events and will need to prioritize resources for different types of events.”
Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This study involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
The questions addressed by this analysis are:
1. Across the United States, which types of events are most harmful with respect to population health?
2. Across the United States, which types of events have the greatest economic consequences?
Operating Environment
sessionInfo()
## R version 3.2.2 (2015-08-14)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] magrittr_1.5 tools_3.2.2 htmltools_0.3 stringi_0.5-5
## [5] rmarkdown_0.9.2 knitr_1.12.3 stringr_1.0.0 digest_0.6.8
## [9] evaluate_0.8
Description of the Source Data
Data download URL:
https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2
Data documentation URL:
https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf
Data FAQ URL:
https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2FNCDC%20Storm%20Events-FAQ%20Page.pdf
Data file name:
noaa1950-2011.bz2
Number of rows:
# NOTE: noaaData is the R variable that holds the data read from the source data file
> nrow(noaaData)
[1] 902297
Data organization:
# NOTE: noaaData is the R variable that holds the data read from the source data file
> str(noaaData)
'data.frame': 902297 obs. of 37 variables:
$ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
$ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
$ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
$ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
$ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
$ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
$ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
$ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
$ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
$ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
$ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
$ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
$ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
$ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
$ COUNTYENDN: logi NA NA NA NA NA NA ...
$ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
$ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
$ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
$ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
$ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
$ F : int 3 2 2 2 2 2 2 1 3 3 ...
$ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
$ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
$ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
$ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
$ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
$ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
$ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
$ WFO : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
$ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
$ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
$ LATITUDE : num 3040 3042 3340 3458 3412 ...
$ LONGITUDE : num 8812 8755 8742 8626 8642 ...
$ LATITUDE_E: num 3051 0 0 0 0 ...
$ LONGITUDE_: num 8806 0 0 0 0 ...
$ REMARKS : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
$ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
Source data variables used in this study:
EVTYPE: event type (e.g., DROUGHT, FLOOD, TORNADO, LIGHTNING)
FATALITIES: number of deceased persons per event type
INJURIES: number of injured persons per event type
PROPDMG: property damage in dollars, expressed as a base number (e.g., 2.5, 3.7)
PROPDMGEXP: property damage multipliers, expressed as letters (e.g., K for thousand, M for million)
CROPDMG: crop damage in dollars, expressed as a base number (e.g., 2.5, 3.7)
CROPDMGEXP: crop damage multipliers, expressed as letters (e.g., K for thousand, M for million)
Reading the Source Data
# Import libraries
library(data.table)
library(ggplot2)
# Set the download URL, the data directory name, and the downloaded zip file name
dataDownloadURL <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
dataDir <- paste0(getwd(), "/repres/prj02/data")
downloadZipName <- paste0(dataDir, "/noaa1950-2011.bz2")
# Make sure the data directory exists
if (!file.exists(dataDir))
dir.create(dataDir)
# Download the data file. Overwrite exisiting file, if any
download.file(dataDownloadURL, destfile=downloadZipName)
# Read the data file. Note: read.csv() allows to read compressed bz2 files directly
noaaData <- read.table(downloadZipName, header = TRUE, sep = ",")
Source Data Transformations
Step 1: Obtain a summary of deceased and injured persons by event type. List the ten most harmful events
# Question 1. Across the United States, which types of events (as indicated in the EVTYPE variable)
# are most harmful with respect to population health?
#
# With respect to the noaaData dataset, "population health" is interpreted as injuries and fatalities
affectedPeople <- noaaData[,"INJURIES"] + noaaData[,"FATALITIES"]
# Create a summary table with two columns: Category: the event type; x: the number of people affected
# From this table, obtain the 10 most harmful event types
affectedPeopleByEvent <- aggregate(affectedPeople, by=list(Category=noaaData$EVTYPE), FUN=sum)
affectedPeopleByEvent <- affectedPeopleByEvent[order(affectedPeopleByEvent$x, decreasing = TRUE),]
affectedPeopleByEvent <- affectedPeopleByEvent[1:10,]
Step 2: Obtain a summary of the property damage, in dollars, by event type
# Question 2. Across the United States, which types of events have the greatest economic consequences?
#
# Document: NATIONAL WEATHER SERVICE INSTRUCTION 10-1605; AUGUST 17, 2007; Operations and Services Performance, NWSPD 10-16; STORM DATA PREPARATION
# Page 12 indicates characters used to represent dollar magnitudes: "K" for thousands, "M" for millions, and "B" for billions.
# Damage to property. Although propDamageMultipliers shows other categorizations, only the ones explained in
# the afore mentioned document will be used
propDamageMultipliers <- noaaData$PROPDMGEXP
unique(propDamageMultipliers)
# Transform propDamageMultipliers to numeric representations
propDamageMultipliers <- as.character(propDamageMultipliers)
propDamageMultipliers[toupper(propDamageMultipliers) == "K"] <- "1000"
propDamageMultipliers[toupper(propDamageMultipliers) == "M"] <- "1000000"
propDamageMultipliers[toupper(propDamageMultipliers) == "B"] <- "1000000000"
# Any other values will be set to zero
propDamageMultipliers[(propDamageMultipliers != "1000") & (propDamageMultipliers != "1000000") & (propDamageMultipliers != 1000000000)] <- "0"
# Set to numeric type
propDamageMultipliers <- as.numeric(propDamageMultipliers)
# Create a table of property damage costs by event type
propDamageByEvent <- noaaData[,"PROPDMG"] * propDamageMultipliers
# Create the summary table for property and damage costs per enet type
propertyConsequencesByEvent <- aggregate(propDamageByEvent, by=list(Category=noaaData$EVTYPE), FUN=sum)
Step 3: Obtain a summary of the crop damage, in dollars, by event type
# Damage to crops. Although propDamageMultipliers shows other categorizations, only the ones explained in
# the afore mentioned document will be used
cropDamageMultipliers <- noaaData$CROPDMGEXP
unique(cropDamageMultipliers)
# Transform cropDamageMultipliers to numeric representations
cropDamageMultipliers <- as.character(cropDamageMultipliers)
cropDamageMultipliers[toupper(cropDamageMultipliers) == "K"] <- "1000"
cropDamageMultipliers[toupper(cropDamageMultipliers) == "M"] <- "1000000"
cropDamageMultipliers[toupper(cropDamageMultipliers) == "B"] <- "1000000000"
# Any other values will be set to zero
cropDamageMultipliers[(cropDamageMultipliers != "1000") & (cropDamageMultipliers != "1000000") & (cropDamageMultipliers != 1000000000)] <- "0"
# Set to numeric type
cropDamageMultipliers <- as.numeric(cropDamageMultipliers)
# Create a table of crop damage costs by event type
cropDamageByEvent <- noaaData[,"CROPDMG"] * cropDamageMultipliers
# Create the summary table for property and damage costs per enet type
cropConsequencesByEvent <- aggregate(cropDamageByEvent, by=list(Category=noaaData$EVTYPE), FUN=sum)
Step 4: From steps 2 and 3, create a consolidated table for both property and crop damage. List the ten most harmful events
# Create a consolidated table of property + crop consequences by event type
propAndCropConsequencesByEvent <- rbind(propertyConsequencesByEvent, cropConsequencesByEvent)
# Create the consolidated economic consequences table. Then, select the 10 most harmfule event types
economicConsequencesByEvent <- aggregate(propAndCropConsequencesByEvent$x, by=list(Category=propAndCropConsequencesByEvent$Category), FUN=sum)
economicConsequencesByEvent <- economicConsequencesByEvent[order(economicConsequencesByEvent$x, decreasing = TRUE),]
economicConsequencesByEvent <- economicConsequencesByEvent[1:10,]
Creation of Output Plots
# Plot the results for the impact to public health
ggplot(affectedPeopleByEvent, aes(x = reorder(affectedPeopleByEvent$Category, affectedPeopleByEvent$x), y = affectedPeopleByEvent$x),
fill = affectedPeopleByEvent$Category) + geom_bar(stat='identity') + coord_flip() +
labs(title='Types of Events most Harmful for Human Health', y="Affected People", x="Event Type")
# Plot the results for the impact to the economy
ggplot(economicConsequencesByEvent, aes(x = reorder(economicConsequencesByEvent$Category, economicConsequencesByEvent$x), y = economicConsequencesByEvent$x),
fill = economicConsequencesByEvent$Category) + geom_bar(stat='identity') + coord_flip() +
labs(title='Types of Events that Have the Greatest Economic Consequences', y="Dollar Amounts", x="Event Type")
Question 1: Across the United States, which types of events are most harmful with respect to population health?
> affectedPeopleByEvent
Category x
834 TORNADO 96979
130 EXCESSIVE HEAT 8428
856 TSTM WIND 7461
170 FLOOD 7259
464 LIGHTNING 6046
275 HEAT 3037
153 FLASH FLOOD 2755
427 ICE STORM 2064
760 THUNDERSTORM WIND 1621
972 WINTER STORM 1527
>
Table 1 - Ten most harmful natural events affecting human heath
Fig. 1 - Plot of the ten most harmful natural events affecting human health, showing the most harmful at the top
Question 2: Across the United States, which types of events have the greatest economic consequences?
> economicConsequencesByEvent
Category x
834 TORNADO 52052113590
170 FLOOD 27819678250
244 HAIL 16958221170
153 FLASH FLOOD 16562128610
95 DROUGHT 13518672000
402 HURRICANE 8910229010
856 TSTM WIND 5038935790
411 HURRICANE/TYPHOON 4903712800
359 HIGH WIND 4608617560
957 WILDFIRE 4020586800
Table 2 - Ten most harmful natural events affecting the economy
Fig. 2 - Plot of the ten most harmful natural events affecting the economy, showing the most harmful at the top
As noted in the Results section, tornadoes are by far the single most harmful natural distaser, affecting more than 90,000 humans and causing economic losses in the neighborhood of 52 billion dollars since 1950. Another natural event directly affecting humans is excessive heat, whereas floods are the second most important cause of economic losses.