The goal of this data analysis is to answer the following questions
Question 1
Across the United States, which types of events are most harmful with respect to population health?
Question 2
Across the United States, which types of events have the greatest economic consequences?
Software
Data Processing
The following software and hardware configuration was used to perform this analysis
sessionInfo()
## R version 3.1.0 (2014-04-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] knitr_1.5
##
## loaded via a namespace (and not attached):
## [1] evaluate_0.5.5 formatR_0.10 stringr_0.6.2 tools_3.1.0
Data Loading
Data for this analysis was sourced from the U.S. National Oceanic and Atmospheric Administration's (NOAA) storm database avaliable at https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2. The data was downloaded and retrieved using the following methodology.
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
destfile = "stormdata.csv.bz2")
## Error: unsupported URL scheme
data <- read.csv(bzfile("stormdata.csv.bz2"), header = TRUE, sep = ",")
The number of attributes in the data set is:
ncol(data)
## [1] 37
The names of attributes in the data set is:
names(data)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
The number of rows in the data set is:
nrow(data)
## [1] 902297
Pre-Processing Data
Filter out attributes which are related to population health and economic impact
data <- subset(x = data, subset = INJURIES > 0 | FATALITIES > 0 | PROPDMG >
0 | CROPDMG > 0, select = c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG",
"PROPDMGEXP", "CROPDMG", "CROPDMGEXP"))
(2) Let's prepare the data to address question 1.
To find which weather event has the most injuries, let's create the injuries variable which adds all injuries in the data$INJURIES column per data$EVTYPE.
injuries <- aggregate(INJURIES ~ EVTYPE, data = data, sum)
To find which weather event has the most fatalities, let's create the fatalities variable which adds all fatalities in the data$FATALITIES column per data$EVTYPE.
fatalities <- aggregate(FATALITIES ~ EVTYPE, data = data, sum)
And to find which weather event is the most harmful, let's create the variable casualties adding data$INJURIES + data$FATALITIES per data$EVTYPE.
data$CASUALTIES <- data$FATALITIES + data$INJURIES
casualties <- aggregate(CASUALTIES ~ EVTYPE, data = data, sum)
(3) For question 2.
Let's calculate the property (data$PROPDMG) and crop (data$CROPDMG) damage per event. First, notice data$PROPDMGEXP and data$CROPDMGEXP are damages' multiplier fields where K,M,B represent thousands, millions and billions in US dollars. Note any corrupt or miscoded values will be ignored.
multiplier <- c(K = 10^3, M = 10^6, B = 10^9)
data$DMG <- data$PROPDMG * multiplier[as.character(data$PROPDMGEXP)] + data$CROPDMG *
multiplier[as.character(data$CROPDMGEXP)]
To find which weather event has the most expensive damages, lets create the damages variable which adds all damages in US dollars (data$DMG column) per data$EVTYPE.
damages <- aggregate(DMG ~ EVTYPE, data = data, sum)
Results
Top 10 weather events with the most injuries:
injuries[order(injuries$INJURIES, decreasing = T), ][1:10, ]
## EVTYPE INJURIES
## 407 TORNADO 91346
## 423 TSTM WIND 6957
## 86 FLOOD 6789
## 61 EXCESSIVE HEAT 6525
## 258 LIGHTNING 5230
## 151 HEAT 2100
## 238 ICE STORM 1975
## 73 FLASH FLOOD 1777
## 364 THUNDERSTORM WIND 1488
## 134 HAIL 1361
Top 10 weather event has the most fatalities, c
fatalities[order(fatalities$FATALITIES, decreasing = T), ][1:10, ]
## EVTYPE FATALITIES
## 407 TORNADO 5633
## 61 EXCESSIVE HEAT 1903
## 73 FLASH FLOOD 978
## 151 HEAT 937
## 258 LIGHTNING 816
## 423 TSTM WIND 504
## 86 FLOOD 470
## 306 RIP CURRENT 368
## 200 HIGH WIND 248
## 11 AVALANCHE 224
Top 10 most harmful events with respect to population health
library(ggplot2)
top_ten <- casualties[order(casualties$CASUALTIES, decreasing = T), ][1:10,
]
ggplot(top_ten, aes(reorder(EVTYPE, -CASUALTIES), CASUALTIES)) + geom_bar(stat = "identity",
fill = "red") + ggtitle("Top 5 Most Harmful Storm Events in the US") + ylab("Number of Casualties (Injuries and Fatalities)") +
theme_bw() + theme(axis.title.x = element_blank())
Conclusion
Question 1
Tornados is the most harmul storm event with respect to population health.
Question 2
Flood is the most expensive storm event with resepect to economy.