Storms and other severe weather events can cause both public health and economic problems, resulting in loss of life, injuries, significant property damage, and/or disruption to commerce. In this report we aim to explore which types of weather events cause the greatest harm to public health and economy respectively across the United States. To investigate, we obtained data from Coursera course site, originally from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. Our analysis found that tornadoes poses the biggest threat to public health whereas flash flood causes the greatest damage to economy.
We first set up the Rstudio and the packages required for the data analysis.
1.1 Set working directory
# setwd("C:/Users/Angashley/Desktop/CourseraR learning/Reproducible Research/Week4/StormData")
# personal details edited out
1.2 Load packages to be used
library(dplyr)
library(reshape2)
library(ggplot2)
library(knitr)
1.3 Set global options
opts_chunk$set(echo = TRUE, fig.width=12, fig.height=8, fig.path='Figs/', cache=TRUE)
We download the StormData.csv.bz2 data file that comes in the form of a comma-separated-value file compressed via the bzip2 algorithm. The read.csv() command is used to read in the data.
bzip2url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
if (!file.exists("StormData.csv.bz2")){
download.file(bzip2url,
destfile = "StormData.csv.bz2")}
StormData <- read.csv("StormData.csv.bz2")
In the early years of Storm Events Database, only events such as Tornado, Thunderstorm Wind and Hail were recorded. More recent years starting from Jan 1996, all 48 event types are recorded. To make comparisons between weather events, we think it’s better to use those recent years when all event types are recorded. The rows and columns we use include:
BGN_DATE (> 1995/12/31)EVTYPEFATALITIESINJURIESPROPDMG and [26]PROPDMGEXPCROPDMG and [28]CROPDMGEXPThe following code does these: 1) it subsets the columns of interest; 2) selects data with a date later than Dec. 31, 1995; and 3) filters data with zero fatalities, injuries, or damages to properties and crops.
stormData <- tbl_df(StormData) %>%
select(BGN_DATE, EVTYPE, FATALITIES,
INJURIES, PROPDMG, PROPDMGEXP,
CROPDMG, CROPDMGEXP) %>%
rename(DATE = BGN_DATE)
stormData$DATE <- format(as.POSIXct(stormData$DATE,
format="%m/%d/%Y"),
format="%Y/%m/%d")
stormdata <- stormData %>% filter(as.Date(DATE) >
as.Date("1995/12/31"))
stormdat <- stormdata %>% filter(FATALITIES > 0 |
INJURIES > 0 |
PROPDMG > 0 |
CROPDMG > 0)
EVTYPEWe examine the event types in the variable EVTYPE, and can see they are not properly prepared, e.g. extra spaces, inconsistent casing and inconsistent event naming (w/o space, singular/plural).
events <- levels(stormdat$EVTYPE)
head(events,20)
## [1] " HIGH SURF ADVISORY" " COASTAL FLOOD"
## [3] " FLASH FLOOD" " LIGHTNING"
## [5] " TSTM WIND" " TSTM WIND (G45)"
## [7] " WATERSPOUT" " WIND"
## [9] "?" "ABNORMAL WARMTH"
## [11] "ABNORMALLY DRY" "ABNORMALLY WET"
## [13] "ACCUMULATED SNOWFALL" "AGRICULTURAL FREEZE"
## [15] "APACHE COUNTY" "ASTRONOMICAL HIGH TIDE"
## [17] "ASTRONOMICAL LOW TIDE" "AVALANCE"
## [19] "AVALANCHE" "BEACH EROSIN"
The following code converts EVTYPE to UPPERCASE, and removes spaces, scale indicator (i.e. G45, G40 etc.), digits and brackets, as well as removes the ending ‘S’ in plurals.
levels(stormdat$EVTYPE) <- toupper(levels(stormdat$EVTYPE))
levels(stormdat$EVTYPE) <- gsub(pattern = " |G[0-9]+|\\d+|[[:punct:]]",
replacement = "",
levels(stormdat$EVTYPE))
levels(stormdat$EVTYPE) <- gsub(pattern = "(.*)S$",
replacement = "\\1",
levels(stormdat$EVTYPE))
PROPDMGEXPand CROPDMGEXPThe following code converts the exponent indexes PROPDMGEXPand CROPDMGEXP to corresponding digits, e.g. B to 9, K to 3 and M to 6. Empty string is 0.
levels(stormdat$PROPDMGEXP)
## [1] "" "B" "K" "M"
levels(stormdat$CROPDMGEXP)
## [1] "" "B" "K" "M"
levels(stormdat$PROPDMGEXP) <- c(0,9,3,6)
levels(stormdat$CROPDMGEXP) <- c(0,9,3,6)
EVTYPE variable) are most harmful with respect to population health?1.1 List the top 10 events that cause greatest loss of life. We can see that the No. 1 destructive event is Excessive Heat. Tornado and Flash Flood are the second and third severe events that cause a large number of fatalities.
fatalities <- stormdat %>% group_by(EVTYPE) %>%
summarise(Fatalities = sum(FATALITIES))
head(arrange(fatalities, desc(Fatalities)), n = 10)
## # A tibble: 10 x 2
## EVTYPE Fatalities
## <fctr> <int>
## 1 EXCESSIVEHEAT 1797
## 2 TORNADO 1511
## 3 FLASHFLOOD 887
## 4 LIGHTNING 651
## 5 RIPCURRENT 542
## 6 FLOOD 414
## 7 TSTMWIND 242
## 8 HEAT 237
## 9 HIGHWIND 235
## 10 AVALANCHE 223
1.2 List the top 10 events that cause greatest injuries. We can see that the No. 1 destructive event is Tornado. Excessive Heat and Flood are the second and third severe events that cause a large number of injuries.
injuries <- stormdat %>% group_by(EVTYPE) %>%
summarise(Injuries = sum(INJURIES))
head(arrange(injuries, desc(Injuries)), n = 10)
## # A tibble: 10 x 2
## EVTYPE Injuries
## <fctr> <int>
## 1 TORNADO 20667
## 2 FLOOD 6758
## 3 EXCESSIVEHEAT 6391
## 4 LIGHTNING 4141
## 5 TSTMWIND 3633
## 6 FLASHFLOOD 1674
## 7 THUNDERSTORMWIND 1400
## 8 WINTERSTORM 1292
## 9 HURRICANETYPHOON 1275
## 10 HEAT 1222
1.3 Plot a bar chart to show visually the top 10 events that cause greatest damage to population health overall.
harmHealth <- stormdat %>% group_by(EVTYPE) %>%
summarise(Fatalities = sum(FATALITIES),
Injuries = sum(INJURIES),
damageTotal = sum(FATALITIES,INJURIES)) %>%
arrange(desc(damageTotal))
harmHealth0 <- harmHealth[1:10,] %>% melt(id.vars = "EVTYPE",
measure.vars = c("Fatalities", "Injuries"),
variable.name = "damageType",
value.name = "damageCount",
factorsAsStrings = TRUE)
ggplot(data = harmHealth0, aes(x = reorder(EVTYPE, damageCount), y = damageCount, fill = damageType)) + geom_bar(stat="identity") +
coord_flip() + labs(fill = "", x = "Storm Events", y = "Population",
title = "Top 10 Storm Events that Cause Damage to Health") +
theme(text = element_text(size = 15),
plot.title = element_text(colour = "blue"))
From the plot above, we can see that Tornado poses the biggest threat to overall population health; Excessive Heat and Flood rank second and third.
2.1 List the top 10 events that cause greatest damage to properties. We can see that the No. 1 destructive event is Flash Flood. Thunderstorm Wind and Tornado are are the second and third events that cause severe consequences to properties.
harmProperty <- stormdat %>% group_by(EVTYPE) %>%
summarise(harmTotal = sum(PROPDMG*10^as.numeric(PROPDMGEXP)))
head(arrange(harmProperty,desc(harmTotal)), n = 10)
## # A tibble: 10 x 2
## EVTYPE harmTotal
## <fctr> <dbl>
## 1 FLASHFLOOD 1364500310
## 2 TSTMWIND 1359600640
## 3 TORNADO 1351198440
## 4 FLOOD 1010592400
## 5 THUNDERSTORMWIND 884963640
## 6 HAIL 685404200
## 7 LIGHTNING 490854780
## 8 HIGHWIND 348342490
## 9 WINTERSTORM 139575650
## 10 WILDFIRE 115760104
2.2 List the top 10 events that cause greatest damage to crops.We can see that the No. 1 destructive event is Hail. Flood and Flash Flood are the second and third events that cause severe consequences to crops.
harmCrop <- stormdat %>% group_by(EVTYPE) %>%
summarise(harmTotal = sum(CROPDMG*10^as.numeric(CROPDMGEXP)))
head(arrange(harmCrop,desc(harmTotal)), n = 10)
## # A tibble: 10 x 2
## EVTYPE harmTotal
## <fctr> <dbl>
## 1 HAIL 516156150
## 2 FLOOD 195276200
## 3 FLASHFLOOD 171641800
## 4 DROUGHT 144412300
## 5 TSTMWIND 113117850
## 6 TORNADO 91869910
## 7 THUNDERSTORMWIND 69651000
## 8 HURRICANE 29493100
## 9 HIGHWIND 22820400
## 10 HEAVYRAIN 17438900
2.3 Plot a bar chart to show visually the top 10 events that cause greatest economic damage overall.
harmEconomy <- stormdat %>% group_by(EVTYPE) %>%
summarise(Properties = sum(PROPDMG*10^as.numeric(PROPDMGEXP)),
Crops = sum(CROPDMG*10^as.numeric(CROPDMGEXP)),
damageTotal = sum(PROPDMG*10^as.numeric(PROPDMGEXP),
CROPDMG*10^as.numeric(CROPDMGEXP))) %>%
arrange(desc(damageTotal))
harmEconomy0 <- harmEconomy[1:10,] %>% melt(id.vars = "EVTYPE",
measure.vars = c("Properties", "Crops"),
variable.name = "damageType",
value.name = "damageCost",
factorsAsStrings = TRUE)
ggplot(data = harmEconomy0, aes(x = reorder(EVTYPE, damageCost), y = damageCost, fill = damageType)) + geom_bar(stat="identity") +
coord_flip() + labs(fill = "", x = "Storm Events", y = "Cost (US dollars)",
title = "Top 10 Storm Events that Cause Damage to Economy") +
theme(text = element_text(size = 15),
plot.title = element_text(colour = "blue"))
From the plot above, we can see that Flash Flood causes the biggest overall economic costs; Thunderstorm Wind and Tornado rank second and third.