TITLE: “Storm Data Analysis – Historical Impact on Health and Economy by Weather Event Type”
SYNOPSIS: Data regarding severe weather events in the United States from 1950 through Novermber 2011 has been collected and reported by the U.S. National Oceanic and Atmospheric Administration (NOAA). It is publicly available at https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2. This study examines the impact of various types of events on public health and on the economy. For puroses of this study, the public health impact of fatalities has been set at ten times the impact of injuries. Similarly, the economic impact of property damage has been set at five times the impact of crop damage. Results indicate that tornadoes are the single most impactful weather event on both public health and the economy. Tornadoes generate 63% of the total impact on public health, and 32% the total impact on the economy.
To begin the analysis, I clear the R workspace, document system information, set the seed for random number generation, and load R packages likely to be used in the analysis.
rm(list = ls())
setwd("~/Desktop/JHU DS Certif/C5 Repro Research/ReproResProject2/Project2")
sessionInfo()
## R version 3.1.0 (2014-04-10)
## Platform: x86_64-apple-darwin10.8.0 (64-bit)
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] colorspace_1.2-4 digest_0.6.4 evaluate_0.5.5 formatR_0.10
## [5] ggplot2_1.0.0 grid_3.1.0 gtable_0.1.2 htmltools_0.2.4
## [9] knitr_1.6 MASS_7.3-33 munsell_0.4.2 plyr_1.8.1
## [13] proto_0.3-10 Rcpp_0.11.1 reshape2_1.4 rmarkdown_0.2.49
## [17] scales_0.2.4 stringr_0.6.2 tools_3.1.0 yaml_2.1.11
set.seed(21247)
library(ggplot2)
require(R.utils)
## Loading required package: R.utils
## Loading required package: R.oo
## Loading required package: R.methodsS3
## R.methodsS3 v1.6.1 (2014-01-04) successfully loaded. See ?R.methodsS3 for help.
## R.oo v1.18.0 (2014-02-22) successfully loaded. See ?R.oo for help.
##
## Attaching package: 'R.oo'
##
## The following objects are masked from 'package:methods':
##
## getClasses, getMethods
##
## The following objects are masked from 'package:base':
##
## attach, detach, gc, load, save
##
## R.utils v1.32.4 (2014-05-14) successfully loaded. See ?R.utils for help.
##
## Attaching package: 'R.utils'
##
## The following object is masked from 'package:utils':
##
## timestamp
##
## The following objects are masked from 'package:base':
##
## cat, commandArgs, getOption, inherits, isOpen, parse, warnings
library(sqldf)
## Loading required package: gsubfn
## Loading required package: proto
## Loading required package: RSQLite
## Loading required package: DBI
## Loading required package: RSQLite.extfuns
library(qcc)
## Warning: package 'qcc' was built under R version 3.1.1
## Package 'qcc', version 2.5
## Type 'citation("qcc")' for citing this R package in publications.
I then access the data on the internet, download it to my machine, and read it into R. The versions available to me do not allow a direct download of a .bz2 into R: # download.file(“https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2”, destfile = “repdata-data-StormData.csv”)
d <- read.csv("~/Desktop/JHU DS Certif/C5 Repro Research/ReproResProject2/repdata-data-StormData.csv")
str(d)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels ""," Christiansburg",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels ""," CANTON"," TULIA",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","%SD",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436781 levels "","\t","\t\t",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
TRANSFORMATION 1: Subjectively I have determined that fatalities are ten times greater harm to public health than are injuries. Further, I chose to focus attention on weather event types for which HARM = (10 * FATALITIES + INJURIES) > 10,000.
harm2health <- sqldf("SELECT EVTYPE, SUM(10 * FATALITIES + INJURIES) HARM
FROM d
GROUP BY EVTYPE
HAVING HARM > 10000
ORDER BY HARM DESC")
## Loading required package: tcltk
RESULT 1: HARM to public health as a function of major weather event type then is displayed in a standard Pareto Chart, and a table is saved that lists the frequencies across all weather event types.
pareto.chart(xtabs(HARM ~ EVTYPE,
data = harm2health, drop.unused.levels = TRUE),
ylab = "Deaths and Injuries", ylab2 = "Cumulative Proportion",
cumperc = seq(0, 100, by = 10),
main = "Deaths and Injuries as a Function of Most Significant Event Types")
##
## Pareto chart analysis for xtabs(HARM ~ EVTYPE, data = harm2health, drop.unused.levels = TRUE)
## Frequency Cum.Freq. Percentage Cum.Percent.
## TORNADO 147676 147676 63.344 63.34
## EXCESSIVE HEAT 25555 173231 10.962 74.31
## LIGHTNING 13390 186621 5.743 80.05
## TSTM WIND 11997 198618 5.146 85.19
## FLASH FLOOD 11557 210175 4.957 90.15
## FLOOD 11489 221664 4.928 95.08
## HEAT 11470 233134 4.920 100.00
harm2health_TABLE <- pareto.chart(xtabs(HARM ~ EVTYPE,
data = harm2health, drop.unused.levels = FALSE),
plot = FALSE)
TRANSFORMATION 2: Subjectively I have determined that property damages are five times greater cost to the economy than are crop damages. Further, I chose to focus attention on weather event types for which COST = (5 * PROPDMG + CROPDMG) > 1,000,000.
economic_damage <- sqldf("SELECT EVTYPE, SUM(5 * PROPDMG + CROPDMG) COST
FROM d
GROUP BY EVTYPE
HAVING COST > 1000000
ORDER BY COST DESC")
RESULT 2: COST to the economy as a function of major weather event type then is displayed in a standard Pareto Chart, and a table is saved that lists the frequencies across all weather event types.
pareto.chart(xtabs(COST ~ EVTYPE,
data = economic_damage, drop.unused.levels = TRUE),
ylab = "Economic Impact (in Billion$)", ylab2 = "Cumulative Proportion",
cumperc = seq(0, 100, by = 10),
main = "Economic Damage as a Function of Most Significant Event Types")
##
## Pareto chart analysis for xtabs(COST ~ EVTYPE, data = economic_damage, drop.unused.levels = TRUE)
## Frequency Cum.Freq. Percentage Cum.Percent.
## TORNADO 16161309 16161309 32.140 32.14
## FLASH FLOOD 7279823 23441133 14.478 46.62
## TSTM WIND 6789031 30230163 13.502 60.12
## FLOOD 4667730 34897894 9.283 69.40
## THUNDERSTORM WIND 4451012 39348906 8.852 78.25
## HAIL 4023063 43371969 8.001 86.26
## LIGHTNING 3020340 46392309 6.007 92.26
## THUNDERSTORM WINDS 2250151 48642459 4.475 96.74
## HIGH WIND 1640941 50283400 3.263 100.00
economic_damage_TABLE <- pareto.chart(xtabs(COST ~ EVTYPE,
data = economic_damage, drop.unused.levels = FALSE),
plot = FALSE)
Finally, the tables that follow compare the “Top 5” event types that impact public health and the economy, respectively.
head(harm2health_TABLE, 5)
##
## Pareto chart analysis for xtabs(HARM ~ EVTYPE, data = harm2health, drop.unused.levels = FALSE)
## Frequency Cum.Freq. Percentage Cum.Percent.
## TORNADO 147676 147676 63.344 63.34
## EXCESSIVE HEAT 25555 173231 10.962 74.31
## LIGHTNING 13390 186621 5.743 80.05
## TSTM WIND 11997 198618 5.146 85.19
## FLASH FLOOD 11557 210175 4.957 90.15
head(economic_damage_TABLE, 5)
##
## Pareto chart analysis for xtabs(COST ~ EVTYPE, data = economic_damage, drop.unused.levels = FALSE)
## Frequency Cum.Freq. Percentage Cum.Percent.
## TORNADO 16161309 16161309 32.140 32.14
## FLASH FLOOD 7279823 23441133 14.478 46.62
## TSTM WIND 6789031 30230163 13.502 60.12
## FLOOD 4667730 34897894 9.283 69.40
## THUNDERSTORM WIND 4451012 39348906 8.852 78.25
In conclusion, tornadoes inflict the greatest damage, having caused 63% of weighted damage to public health, which is more than 10 times the impact of the next most destructive event type, and having also caused more than twice the impact of any other weather event type on the weighted economy measure.