TITLE: “Storm Data Analysis – Historical Impact on Health and Economy by Weather Event Type”

SYNOPSIS: Data regarding severe weather events in the United States from 1950 through Novermber 2011 has been collected and reported by the U.S. National Oceanic and Atmospheric Administration (NOAA). It is publicly available at https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2. This study examines the impact of various types of events on public health and on the economy. For puroses of this study, the public health impact of fatalities has been set at ten times the impact of injuries. Similarly, the economic impact of property damage has been set at five times the impact of crop damage. Results indicate that tornadoes are the single most impactful weather event on both public health and the economy. Tornadoes generate 63% of the total impact on public health, and 32% the total impact on the economy.

To begin the analysis, I clear the R workspace, document system information, set the seed for random number generation, and load R packages likely to be used in the analysis.

rm(list = ls()) 
setwd("~/Desktop/JHU DS Certif/C5 Repro Research/ReproResProject2/Project2")
sessionInfo()
## R version 3.1.0 (2014-04-10)
## Platform: x86_64-apple-darwin10.8.0 (64-bit)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] colorspace_1.2-4 digest_0.6.4     evaluate_0.5.5   formatR_0.10    
##  [5] ggplot2_1.0.0    grid_3.1.0       gtable_0.1.2     htmltools_0.2.4 
##  [9] knitr_1.6        MASS_7.3-33      munsell_0.4.2    plyr_1.8.1      
## [13] proto_0.3-10     Rcpp_0.11.1      reshape2_1.4     rmarkdown_0.2.49
## [17] scales_0.2.4     stringr_0.6.2    tools_3.1.0      yaml_2.1.11
set.seed(21247)
library(ggplot2)
require(R.utils)
## Loading required package: R.utils
## Loading required package: R.oo
## Loading required package: R.methodsS3
## R.methodsS3 v1.6.1 (2014-01-04) successfully loaded. See ?R.methodsS3 for help.
## R.oo v1.18.0 (2014-02-22) successfully loaded. See ?R.oo for help.
## 
## Attaching package: 'R.oo'
## 
## The following objects are masked from 'package:methods':
## 
##     getClasses, getMethods
## 
## The following objects are masked from 'package:base':
## 
##     attach, detach, gc, load, save
## 
## R.utils v1.32.4 (2014-05-14) successfully loaded. See ?R.utils for help.
## 
## Attaching package: 'R.utils'
## 
## The following object is masked from 'package:utils':
## 
##     timestamp
## 
## The following objects are masked from 'package:base':
## 
##     cat, commandArgs, getOption, inherits, isOpen, parse, warnings
library(sqldf)
## Loading required package: gsubfn
## Loading required package: proto
## Loading required package: RSQLite
## Loading required package: DBI
## Loading required package: RSQLite.extfuns
library(qcc)
## Warning: package 'qcc' was built under R version 3.1.1
## Package 'qcc', version 2.5
## Type 'citation("qcc")' for citing this R package in publications.

I then access the data on the internet, download it to my machine, and read it into R. The versions available to me do not allow a direct download of a .bz2 into R: # download.file(“https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2”, destfile = “repdata-data-StormData.csv”)

d <- read.csv("~/Desktop/JHU DS Certif/C5 Repro Research/ReproResProject2/repdata-data-StormData.csv")
str(d)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","  N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels ""," Christiansburg",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels ""," CANTON"," TULIA",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels ""," CI","%SD",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436781 levels "","\t","\t\t",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

TRANSFORMATION 1: Subjectively I have determined that fatalities are ten times greater harm to public health than are injuries. Further, I chose to focus attention on weather event types for which HARM = (10 * FATALITIES + INJURIES) > 10,000.

harm2health <- sqldf("SELECT EVTYPE, SUM(10 * FATALITIES + INJURIES) HARM             
                      FROM d             
                      GROUP BY EVTYPE
                      HAVING HARM > 10000
                      ORDER BY HARM DESC")
## Loading required package: tcltk

RESULT 1: HARM to public health as a function of major weather event type then is displayed in a standard Pareto Chart, and a table is saved that lists the frequencies across all weather event types.

pareto.chart(xtabs(HARM ~ EVTYPE, 
                      data = harm2health, drop.unused.levels = TRUE), 
                      ylab = "Deaths and Injuries", ylab2 = "Cumulative Proportion", 
                      cumperc = seq(0, 100, by = 10), 
                      main = "Deaths and Injuries as a Function of Most Significant Event Types")

plot of chunk unnamed-chunk-4

##                 
## Pareto chart analysis for xtabs(HARM ~ EVTYPE, data = harm2health, drop.unused.levels = TRUE)
##                  Frequency Cum.Freq. Percentage Cum.Percent.
##   TORNADO           147676    147676     63.344        63.34
##   EXCESSIVE HEAT     25555    173231     10.962        74.31
##   LIGHTNING          13390    186621      5.743        80.05
##   TSTM WIND          11997    198618      5.146        85.19
##   FLASH FLOOD        11557    210175      4.957        90.15
##   FLOOD              11489    221664      4.928        95.08
##   HEAT               11470    233134      4.920       100.00
harm2health_TABLE <- pareto.chart(xtabs(HARM ~ EVTYPE, 
                      data = harm2health, drop.unused.levels = FALSE), 
                      plot = FALSE)

TRANSFORMATION 2: Subjectively I have determined that property damages are five times greater cost to the economy than are crop damages. Further, I chose to focus attention on weather event types for which COST = (5 * PROPDMG + CROPDMG) > 1,000,000.

economic_damage <- sqldf("SELECT EVTYPE, SUM(5 * PROPDMG + CROPDMG) COST
                    FROM d             
                    GROUP BY EVTYPE             
                    HAVING COST > 1000000              
                    ORDER BY COST DESC")

RESULT 2: COST to the economy as a function of major weather event type then is displayed in a standard Pareto Chart, and a table is saved that lists the frequencies across all weather event types.

pareto.chart(xtabs(COST ~ EVTYPE, 
                    data = economic_damage, drop.unused.levels = TRUE), 
                    ylab = "Economic Impact (in Billion$)", ylab2 = "Cumulative Proportion", 
                    cumperc = seq(0, 100, by = 10), 
                    main = "Economic Damage as a Function of Most Significant Event Types")

plot of chunk unnamed-chunk-6

##                     
## Pareto chart analysis for xtabs(COST ~ EVTYPE, data = economic_damage, drop.unused.levels = TRUE)
##                      Frequency Cum.Freq. Percentage Cum.Percent.
##   TORNADO             16161309  16161309     32.140        32.14
##   FLASH FLOOD          7279823  23441133     14.478        46.62
##   TSTM WIND            6789031  30230163     13.502        60.12
##   FLOOD                4667730  34897894      9.283        69.40
##   THUNDERSTORM WIND    4451012  39348906      8.852        78.25
##   HAIL                 4023063  43371969      8.001        86.26
##   LIGHTNING            3020340  46392309      6.007        92.26
##   THUNDERSTORM WINDS   2250151  48642459      4.475        96.74
##   HIGH WIND            1640941  50283400      3.263       100.00
economic_damage_TABLE <- pareto.chart(xtabs(COST ~ EVTYPE, 
                    data = economic_damage, drop.unused.levels = FALSE), 
                    plot = FALSE)

Finally, the tables that follow compare the “Top 5” event types that impact public health and the economy, respectively.

head(harm2health_TABLE, 5)
##                 
## Pareto chart analysis for xtabs(HARM ~ EVTYPE, data = harm2health, drop.unused.levels = FALSE)
##                  Frequency Cum.Freq. Percentage Cum.Percent.
##   TORNADO           147676    147676     63.344        63.34
##   EXCESSIVE HEAT     25555    173231     10.962        74.31
##   LIGHTNING          13390    186621      5.743        80.05
##   TSTM WIND          11997    198618      5.146        85.19
##   FLASH FLOOD        11557    210175      4.957        90.15
head(economic_damage_TABLE, 5)
##                    
## Pareto chart analysis for xtabs(COST ~ EVTYPE, data = economic_damage, drop.unused.levels = FALSE)
##                     Frequency Cum.Freq. Percentage Cum.Percent.
##   TORNADO            16161309  16161309     32.140        32.14
##   FLASH FLOOD         7279823  23441133     14.478        46.62
##   TSTM WIND           6789031  30230163     13.502        60.12
##   FLOOD               4667730  34897894      9.283        69.40
##   THUNDERSTORM WIND   4451012  39348906      8.852        78.25

In conclusion, tornadoes inflict the greatest damage, having caused 63% of weighted damage to public health, which is more than 10 times the impact of the next most destructive event type, and having also caused more than twice the impact of any other weather event type on the weighted economy measure.