Analysis of Which Storm was the Most Devastating in terms of Population and Cost Impact

The goal of this data analysis is to answer the following questions

Question 1

Across the United States, which types of events are most harmful with respect to population health?

Question 2

Across the United States, which types of events have the greatest economic consequences?

Software

Data Processing

The following software and hardware configuration was used to perform this analysis

sessionInfo()
## R version 3.1.0 (2014-04-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] knitr_1.5
## 
## loaded via a namespace (and not attached):
## [1] evaluate_0.5.5 formatR_0.10   stringr_0.6.2  tools_3.1.0

Data Loading

Data for this analysis was sourced from the U.S. National Oceanic and Atmospheric Administration's (NOAA) storm database avaliable at https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2. The data was downloaded and retrieved using the following methodology.

download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", 
    destfile = "stormdata.csv.bz2")
## Error: unsupported URL scheme
data <- read.csv(bzfile("stormdata.csv.bz2"), header = TRUE, sep = ",")

The number of attributes in the data set is:

ncol(data)
## [1] 37

The names of attributes in the data set is:

names(data)
##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"

The number of rows in the data set is:

nrow(data)
## [1] 902297

Pre-Processing Data

Filter out attributes which are related to population health and economic impact

data <- subset(x = data, subset = INJURIES > 0 | FATALITIES > 0 | PROPDMG > 
    0 | CROPDMG > 0, select = c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", 
    "PROPDMGEXP", "CROPDMG", "CROPDMGEXP"))

(2) Let's prepare the data to address question 1.

To find which weather event has the most injuries, let's create the injuries variable which adds all injuries in the data$INJURIES column per data$EVTYPE.

injuries <- aggregate(INJURIES ~ EVTYPE, data = data, sum)

To find which weather event has the most fatalities, let's create the fatalities variable which adds all fatalities in the data$FATALITIES column per data$EVTYPE.

fatalities <- aggregate(FATALITIES ~ EVTYPE, data = data, sum)

And to find which weather event is the most harmful, let's create the variable casualties adding data$INJURIES + data$FATALITIES per data$EVTYPE.

data$CASUALTIES <- data$FATALITIES + data$INJURIES
casualties <- aggregate(CASUALTIES ~ EVTYPE, data = data, sum)

(3) For question 2.

Let's calculate the property (data$PROPDMG) and crop (data$CROPDMG) damage per event. First, notice data$PROPDMGEXP and data$CROPDMGEXP are damages' multiplier fields where K,M,B represent thousands, millions and billions in US dollars. Note any corrupt or miscoded values will be ignored.

multiplier <- c(K = 10^3, M = 10^6, B = 10^9)
data$DMG <- data$PROPDMG * multiplier[as.character(data$PROPDMGEXP)] + data$CROPDMG * 
    multiplier[as.character(data$CROPDMGEXP)]

To find which weather event has the most expensive damages, lets create the damages variable which adds all damages in US dollars (data$DMG column) per data$EVTYPE.

damages <- aggregate(DMG ~ EVTYPE, data = data, sum)

Results

Top 10 weather events with the most injuries:

injuries[order(injuries$INJURIES, decreasing = T), ][1:10, ]
##                EVTYPE INJURIES
## 407           TORNADO    91346
## 423         TSTM WIND     6957
## 86              FLOOD     6789
## 61     EXCESSIVE HEAT     6525
## 258         LIGHTNING     5230
## 151              HEAT     2100
## 238         ICE STORM     1975
## 73        FLASH FLOOD     1777
## 364 THUNDERSTORM WIND     1488
## 134              HAIL     1361

Top 10 weather event has the most fatalities, c

fatalities[order(fatalities$FATALITIES, decreasing = T), ][1:10, ]
##             EVTYPE FATALITIES
## 407        TORNADO       5633
## 61  EXCESSIVE HEAT       1903
## 73     FLASH FLOOD        978
## 151           HEAT        937
## 258      LIGHTNING        816
## 423      TSTM WIND        504
## 86           FLOOD        470
## 306    RIP CURRENT        368
## 200      HIGH WIND        248
## 11       AVALANCHE        224

Top 10 most harmful events with respect to population health

library(ggplot2)
top_ten <- casualties[order(casualties$CASUALTIES, decreasing = T), ][1:10, 
    ]
ggplot(top_ten, aes(reorder(EVTYPE, -CASUALTIES), CASUALTIES)) + geom_bar(stat = "identity", 
    fill = "red") + ggtitle("Top 5 Most Harmful Storm Events in the US") + ylab("Number of Casualties (Injuries and Fatalities)") + 
    theme_bw() + theme(axis.title.x = element_blank())

plot of chunk unnamed-chunk-14

Conclusion

Question 1

Tornados is the most harmul storm event with respect to population health.

Question 2

Flood is the most expensive storm event with resepect to economy.