In this report we aim to find what types of severe weather events are the major causes of public health and economic problems across the U.S.. The results are:
When concerning public health, tornado has caused the most harm to population health during 1996-2011 across the U.S.. Excessive heat has caused the most fatalities. Flood, thunderstorm wind, lighting and flash flood also did great harm to population health in gerneral;
When concerning economic consequences, flood has caused the greatest economic consequences during 1996-2011 across the U.S.. Drought has caused the most crops damage. Hurricane/typhoon, storm surge, tornado, hail and flash flood also caused great economic damages in gerneral.
Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
Through exploring the NOAA Storm Database, following questions about severe weather events are to be answered:
Across the United States, which types of events are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?
The data for this report can be downloaded here. It comes in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. Both ‘bunzip2’ function from ‘R.utils’ package and ‘fread’ from ‘data.tabel’ package are used to speed up. In addition, the possible problem of wrong row number due to CRLF(EOLs) in Windows OS can also be avoided by ‘fread’ function.
library(R.utils)
## Warning: package 'R.utils' was built under R version 3.4.1
## Loading required package: R.oo
## Loading required package: R.methodsS3
## R.methodsS3 v1.7.1 (2016-02-15) successfully loaded. See ?R.methodsS3 for help.
## R.oo v1.21.0 (2016-10-30) successfully loaded. See ?R.oo for help.
##
## Attaching package: 'R.oo'
## The following objects are masked from 'package:methods':
##
## getClasses, getMethods
## The following objects are masked from 'package:base':
##
## attach, detach, gc, load, save
## R.utils v2.5.0 (2016-11-07) successfully loaded. See ?R.utils for help.
##
## Attaching package: 'R.utils'
## The following object is masked from 'package:utils':
##
## timestamp
## The following objects are masked from 'package:base':
##
## cat, commandArgs, getOption, inherits, isOpen, parse, warnings
library(data.table)
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.4.1
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:data.table':
##
## between, first, last
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(RColorBrewer)
URL <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
if (!file.exists("data")) {dir.create("data")}
if (!file.exists("./data/StormData.csv.bz2")) {
download.file(URL, "./data/StormData.csv.bz2")
}
if (!file.exists("./data/StormData.csv")) {
bunzip2("./data/StormData.csv.bz2", remove = FALSE)
}
data <- fread("./data/StormData.csv")
##
Read 0.0% of 967216 rows
Read 23.8% of 967216 rows
Read 42.4% of 967216 rows
Read 55.8% of 967216 rows
Read 73.4% of 967216 rows
Read 81.7% of 967216 rows
Read 93.1% of 967216 rows
Read 902297 rows and 37 (of 37) columns from 0.523 GB file in 00:00:14
dim(data)
## [1] 902297 37
1.Shrinking the data to make the analysis faster
Firstly, ‘select’ is used to delete those variables that cannot help this analysis. Then ‘filter’ is used to keep only the records after 1996, since only from Jan. 1996 all events type started to be recorded. Since our objective is comparing the losses caused by different weather events, the records without any loss can be ignored.
sdata <- select(data,c(2,5,7,8,23:28,37))
sdata$BGN_DATE <- as.Date(sdata$BGN_DATE, format = "%m/%d/%Y %T")
sdata <- filter(sdata, year(BGN_DATE) >= 1996)
## Warning: package 'bindrcpp' was built under R version 3.4.1
dim(sdata)
## [1] 653530 11
sdata <- filter(sdata, !(FATALITIES == 0 &
INJURIES == 0 &
PROPDMG == 0 &
CROPDMG == 0))
dim(sdata)
## [1] 201318 11
2.Glance at the roughly summarised data
A function ‘exchange’ is created to calculate the actual number of damage data. Then a roughly summarised dataframe ‘tdata’ is created to compare the losses caused by different events.
When summarising data, since the number of fatalities is more influential on popluation health than injuries, the number of injuries are divided by 10 to decrease its weight. The variable ‘HPH’ (harm to popluation health) is introduced which equals the sum of the number of fatalities and 1/10 of the number of injuries.
exchange <- function(x) {
if (x == "") {x <- 0
} else if (x == "K") {x <- 1000
} else if (x == "M") {x <- 1000000
} else if (x == "B") {x <- 1000000000
}
x
}
sdata$PROPDMGEXP <- sapply(sdata$PROPDMGEXP, exchange)
sdata$CROPDMGEXP <- sapply(sdata$CROPDMGEXP, exchange)
sdata <- sdata %>%
mutate(PROPDMG = PROPDMG * PROPDMGEXP,
CROPDMG = CROPDMG * CROPDMGEXP) %>%
select(-c(8,10))
tdata <- sdata %>% group_by(EVTYPE) %>%
summarise(Fatalities = sum(FATALITIES),
Injuries = sum(INJURIES),
Prop.DMG = sum(PROPDMG),
Crop.DMG = sum(CROPDMG),
HPH = sum(FATALITIES) + sum(INJURIES) / 10,
Dmg = sum(PROPDMG) + sum(CROPDMG)
)
Through this step we try to get a list of the most influential events, then we are going to only fix the typos of these about 20 events to save much time.
l1 <- arrange(tdata, desc(HPH))$EVTYPE[1:25]
l2 <- arrange(tdata, desc(Dmg))$EVTYPE[1:25]
list <- sort(unique(c(l1,l2)))
list
## [1] "AVALANCHE" "BLIZZARD"
## [3] "COLD/WIND CHILL" "DROUGHT"
## [5] "EXCESSIVE HEAT" "EXTREME COLD"
## [7] "EXTREME COLD/WIND CHILL" "FLASH FLOOD"
## [9] "FLOOD" "FOG"
## [11] "FROST/FREEZE" "HAIL"
## [13] "HEAT" "HEAVY RAIN"
## [15] "HEAVY SNOW" "HIGH SURF"
## [17] "HIGH WIND" "HURRICANE"
## [19] "HURRICANE/TYPHOON" "ICE STORM"
## [21] "LIGHTNING" "RIP CURRENT"
## [23] "RIP CURRENTS" "STORM SURGE"
## [25] "STORM SURGE/TIDE" "STRONG WIND"
## [27] "THUNDERSTORM WIND" "TORNADO"
## [29] "TROPICAL STORM" "TSTM WIND"
## [31] "TYPHOON" "WILD/FOREST FIRE"
## [33] "WILDFIRE" "WINTER STORM"
3.Fixing typos of major events
This step fixes typos of those most influential events according to the official events type list, after carefully examined the full list of events in tdata.
EVTYPE <- tolower(tdata$EVTYPE)
istypo <- grepl("^coastal( *)flood|erosion",EVTYPE)
EVTYPE[istypo] <- "coastal flood"
istypo <- grepl("^cold",EVTYPE)
EVTYPE[istypo] <- "cold/wind chill"
istypo <- grepl("extended cold|extreme cold(.*)|extreme windchill|unseasonabl. cold",EVTYPE)
EVTYPE[istypo] <- "extreme cold"
istypo <- grepl("flash flood|flash/flood",EVTYPE)
EVTYPE[istypo] <- "flash flood"
istypo <- grepl("^flood|fld|river flood",EVTYPE)
EVTYPE[istypo] <- "flood"
istypo <- grepl("^fog|dense fog",EVTYPE)
EVTYPE[istypo] <- "fog"
istypo <- grepl("frost|freeze",EVTYPE)
EVTYPE[istypo] <- "frost/freeze"
istypo <- grepl("heavy rain",EVTYPE)
EVTYPE[istypo] <- "heavy rain"
istypo <- grepl("high surf|heavy surf",EVTYPE)
EVTYPE[istypo] <- "high surf"
istypo <- grepl("hurricane|typhoon",EVTYPE)
EVTYPE[istypo] <- "hurricane/typhoon"
istypo <- grepl("slide|slump",EVTYPE)
EVTYPE[istypo] <- "landslide"
istypo <- grepl("rip current",EVTYPE)
EVTYPE[istypo] <- "rip current"
istypo <- grepl("storm surge",EVTYPE)
EVTYPE[istypo] <- "storm surge"
istypo <- grepl("strong wind",EVTYPE)
EVTYPE[istypo] <- "strong wind"
istypo <- grepl("^ ?tstm wind|thunderstorm wind",EVTYPE)
EVTYPE[istypo] <- "thunderstorm wind"
istypo <- grepl("fire",EVTYPE)
EVTYPE[istypo] <- "wildfire"
tdata$EVTYPE <- EVTYPE
4.Summarise the data
fdata <- tdata %>% group_by(EVTYPE) %>%
summarise(Fatalities = sum(Fatalities),
Injuries = sum(Injuries),
HPH = sum(HPH),
Prop.DMG = sum(Prop.DMG),
Crop.DMG = sum(Crop.DMG),
Dmg = sum(Dmg))
1.Which types of events are most harmful to population health?
The first 15 events which caused the most harm to population health are shown in the following figure:
nrows <- 15
par(mar = c(1,1,6.5,2), mfrow = c(1,1),cex = 0.75)
fdata <- arrange(fdata,desc(HPH))
x <- rev(fdata$EVTYPE[1:nrows])
y1 <- rev(fdata$HPH[1:nrows])
y2 <- rev(fdata$Fatalities[1:nrows])
barplot(y1, horiz = T, xlim = c(-1400, 4000), axes =F,
col = rep(brewer.pal(9, "Reds")[2:6],each = nrows/5))
barplot(y2, horiz = T, xlim = c(-1400, 4000), axes =F, width = 0.8,space = 0.5,
col = rep(brewer.pal(9, "BuGn")[3:7],each = 4), add = TRUE)
text(seq(from = 0.7, length.out = nrows, by = 1.2), x = 0, label = x, pos = 2)
text(seq(from = 0.7, length.out = nrows, by = 1.2),x = y1, label = round(y1), pos = 4)
text(seq(from = 0.7, length.out = nrows, by = 1.2),x = y2 + 50,
label = rev(c(fdata$Fatalities[1:7],rep("",nrows-7))), pos = 2)
axis(3, c(0,1000,2000,3000,4000), c('0','1000','2000','3000','4000'))
title(main = "Harm to population health of different events")
legend(2000, 4.5, title = "Legend",
legend = c("Fatalities","Fatalities + Injuries/10"),
fill = c(brewer.pal(9, "BuGn")[5],brewer.pal(9, "Reds")[5]))
ratio1 <- paste(round(sum(fdata$HPH[1:7])/sum(fdata$HPH)*100), "%", sep = "")
From this figure it can be infered that tornado has caused the most harm to population health during 1996-2011 across the U.S.. Excessive heat has caused the most fatalities. Flood, lighting, flash flood, thunderstorm wind and rip current also did great harm to population health in gerneral. These 7 types of events caused 74% of HPH among all the events.
2.which types of events have the greatest economic consequences
The first 15 events which caused the greatest economic damages (sum of properties and crops) are shown in the following figure:
nrows <- 15
par(mar = c(1,1,6.5,2), mfrow = c(1,1),cex = 0.75)
fdata <- arrange(fdata,desc(Dmg))
x <- rev(fdata$EVTYPE[1:nrows])
y1 <- rev(fdata$Dmg[1:nrows])/1000000000
y2 <- rev(fdata$Crop.DMG[1:nrows])/1000000000
barplot(y1, horiz = T, xlim = c(-60, 170), axes =F,
col = rep(brewer.pal(9, "Reds")[2:6],each = nrows/5))
barplot(y2, horiz = T, xlim = c(-60, 170), axes =F, width = 0.8,space = 0.5,
col = rep(brewer.pal(9, "BuGn")[3:7],each = 4), add = TRUE)
text(seq(from = 0.7, length.out = nrows, by = 1.2), x = 0, label = x, pos = 2)
text(seq(from = 0.7, length.out = nrows, by = 1.2),x = y1, label = round(y1, 1), pos = 4)
text(seq(from = 0.7, length.out = nrows, by = 1.2),x = y2 - 1.5,
label = rev(c(round(fdata$Crop.DMG[1:6]/1000000000,1),rep("",nrows-6))), pos = 4)
text(0.7 + 1.2 * (nrows-7), x = y2[nrows-6] + 1.5, label = round(y2[nrows-6], 1), pos = 2)
axis(3, c(0,40,80,120,160), c('0','40','80','120','160'))
title(main = "Economic damage (billion $) of different events")
legend(85, 4.5, title = "Legend",
legend = c("Crop.DMG","Crop.DMG + Prop.DMG"),
fill = c(brewer.pal(9, "BuGn")[5],brewer.pal(9, "Reds")[5]))
ratio2 <- paste(round(sum(fdata$Dmg[1:6])/sum(fdata$Dmg)*100), "%", sep = "")
From this figure it can be infered that flood has caused the greatest economic consequences during 1996-2011 across the U.S.. Drought should also be noticed since it has caused the most crops damage. Hurricane/typhoon, storm surge, tornado, hail and flash flood also caused great economic damages in gerneral. These 7 types of events caused 85% of economic damage among all the events.