Synopsis: The following analysis is intended to identify and convey the types of Storm Events (tornados, floods, hail, etc) have been most harmful with respect to population health and also to property damage in the US for the period of time from 1950 thru 1972. Data was sourced from U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database and found here: https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2. This analysis is to answer the following question:
From the results, we can find that tornado is the most harmful storm event to both health and economic.
Before investigating, check the R environment.
print(sessionInfo())
## R version 3.2.2 (2015-08-14)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.10.5 (Yosemite)
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] magrittr_1.5 formatR_1.2.1 tools_3.2.2 htmltools_0.2.6
## [5] yaml_2.1.13 stringi_0.5-5 rmarkdown_0.8 knitr_1.11
## [9] stringr_1.0.0 digest_0.6.8 evaluate_0.7.2
For grouping and summarizing, I will use dplyr
pacakge.
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Download compressed file containing Storm Data from the internet, unzip the file and load data into R. Strip leading and trailing white space in the data as it is read into R.
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", "dataset.zip")
dataset <- read.csv(bzfile("dataset.zip"), header = TRUE)
dim(dataset)
## [1] 902297 37
names(dataset)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
From the columns names, it is needed to focus EVTYPE
, FATALITIES
, INJURIES
, CROPDMG
, and PROPDMG
. FATALITIES
and INJUIRES
represent an effect to health, and CROPDMG
and PROPDMG
can be used as an effect to economic.
q1.df <- dataset %>% group_by(EVTYPE) %>%
summarise(fatalities = sum(FATALITIES), injuries = sum(INJURIES)) %>%
mutate(harmfulness = fatalities + injuries) %>% arrange(desc(harmfulness))
head(q1.df, 5)
## Source: local data frame [5 x 4]
##
## EVTYPE fatalities injuries harmfulness
## 1 TORNADO 5633 91346 96979
## 2 EXCESSIVE HEAT 1903 6525 8428
## 3 TSTM WIND 504 6957 7461
## 4 FLOOD 470 6789 7259
## 5 LIGHTNING 816 5230 6046
barplot(q1.df$harmfulness[1:5], names.arg = q1.df$EVTYPE[1:5],
xlab = "Event Type", ylab = "Fatalities + Injuries(person)",
main = "Top 5 Most Harmful Storm Events (Health)")
q2.df <- dataset %>% group_by(EVTYPE) %>%
summarise(crop = sum(CROPDMG), property = sum(PROPDMG)) %>%
mutate(econ_dmg = crop + property) %>% arrange(desc(econ_dmg))
head(q2.df, 5)
## Source: local data frame [5 x 4]
##
## EVTYPE crop property econ_dmg
## 1 TORNADO 100018.5 3212258.2 3312277
## 2 FLASH FLOOD 179200.5 1420124.6 1599325
## 3 TSTM WIND 109202.6 1335965.6 1445168
## 4 HAIL 579596.3 688693.4 1268290
## 5 FLOOD 168037.9 899938.5 1067976
barplot(q2.df$econ_dmg[1:5], names.arg = q1.df$EVTYPE[1:5],
xlab = "Event Type", ylab = "Crop + Property($)",
main = "Top 5 Most Harmful Storm Events (Economic)")
From the above figure, we can find that tornado is the most hamful storm events to economic also.