Severe weather has great detrimental consequencies to human health and economy. This report was concerned with identifying top 10 types of events that have highest human fatalities, highest human injuries and highest economic consequencies (combined property damage and crop damage). I used the data provided by National Oceanic and Atmospheric Administration documenting severe weather events from 1950 to 2011. Since event types were not reproducibly labeled (for each event types there were multiple nonmatching entries) and could not be used for a reliable analysis, I tidyed up the dataframe by using regular expressions. Plots of the top types of events that have highest human fatalities, highest human injuries and highest economic consequencies showed that tornadoes are the most harmful if measured by the number of human fatalities or by the number of injuries. Flood was identified as the top event type with the highest economic consequencies.
First setting things up (clearing environment, setting work directory, loading required packages)
# setting things up
rm(list = ls())
setwd("/home/pavel/Documents/classes/dataScSpec/reprRes/Assignment2")
require(ggplot2)
## Loading required package: ggplot2
require(dplyr)
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Downloading and reading the file with the data
url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
if (!file.exists("repdata%2Fdata%2FStormData.csv.bz2")){
download.file(url, destfile = "repdata%2Fdata%2FStormData.csv.bz2")
}
# reading the file
d.raw <- read.csv("repdata%2Fdata%2FStormData.csv.bz2")
Since EVTYPE, PROPDMGEXP and CROPDMGEXP columns contain mixed uppercase and lowercase letters, lets create a dataframe where these columns are lowercase to simplify further analyss.
d.low <- d.raw
d.low$EVTYPE <- tolower(d.low$EVTYPE)
d.low$PROPDMGEXP <- tolower(d.low$PROPDMGEXP)
d.low$CROPDMGEXP <- tolower(d.low$CROPDMGEXP)
Individual variables in the column with the types of events are not ready for analysis since for each type of events there multiple matching values. For example, for the event “tornado” values include: “tornado”, “tornados”, “tornadoes” etc. Here I am attempting to tidy up the dataframe, so it would be possible to analyse by event type.
d.low$EVTYPE[grep("torna", d.low$EVTYPE)] <- "tornado"
d.low$EVTYPE[grep("heat", d.low$EVTYPE)] <- "excessive heat"
d.low$EVTYPE[grep("flash.*flood.*|flood.*flash.*", d.low$EVTYPE)] <- "flash flood"
d.low$EVTYPE[grep("^t(.)+wind(.)*", d.low$EVTYPE)] <- "thunderstorm wind"
d.low$EVTYPE[grep("flood", d.low$EVTYPE)] <- "flood"
d.low$EVTYPE[grep("lightn", d.low$EVTYPE)] <- "lightning"
d.low$EVTYPE[grep("rip", d.low$EVTYPE)] <- "rip current"
d.low$EVTYPE[grep("high.*wind.*", d.low$EVTYPE)] <- "high wind"
d.low$EVTYPE[grep("avala", d.low$EVTYPE)] <- "avalanche"
d.low$EVTYPE[grep("winter.*storm.*", d.low$EVTYPE)] <- "winter storm"
d.low$EVTYPE[grep("extreme|severe&cold", d.low$EVTYPE)] <- "extreme cold"
d.low$EVTYPE[grep("heavy.*snow.*", d.low$EVTYPE)] <- "heavy snow"
d.low$EVTYPE[grep("ice.*storm.*", d.low$EVTYPE)] <- "ice storm"
d.low$EVTYPE[grep("hail", d.low$EVTYPE)] <- "hail"
d.low$EVTYPE[grep("hurricane|typhoon", d.low$EVTYPE)] <- "hurricane/typhoon"
d.low$EVTYPE[grep("fire", d.low$EVTYPE)] <- "wild fire"
d.low$EVTYPE[grep("blizzard", d.low$EVTYPE)] <- "blizzard"
d.low$EVTYPE[grep("storm surge", d.low$EVTYPE)] <- "storm surge"
d.low$EVTYPE[grep("tropical storm", d.low$EVTYPE)] <- "tropical storm"
d.low$EVTYPE[grep("drought", d.low$EVTYPE)] <- "drought"
d.low$EVTYPE[grep("frost.*freeze.*", d.low$EVTYPE)] <- "frost/freeze"
The crop damage and property damage for each event are provided with the number and the multiplier (m for million, b for billion etc). Converting property damage and crop damage into absolute numbers by first creating a function that converts strings with multiplier code (“m”, “b”, “3” etc) into numeric exponents and then by multiplying the damage number by the 10 raised to the power of exponent calculated with the function.
convExp <- function(e) {
if (e == "b") return (9)
else if (e == "m") return (6)
else if (e == "k") return (3)
else if (e == "h") return (2)
else if (e %in% c(0,1,2,3,4,5,6,7,8,9)) return (e)
else return(0)
}
d.low$absPropDmg <- as.numeric(d.low$PROPDMG) * 10 ^ as.numeric(sapply(d.low$PROPDMGEXP, FUN=convExp))
d.low$absCrpDmg <- as.numeric(d.low$CROPDMG) * 10 ^ as.numeric(sapply(d.low$CROPDMGEXP, FUN=convExp))
In order to analyze economic impact of property damage and crop damage, I am going to sum these columns.
d.low$econDmg <- d.low$absPropDmg + d.low$absCrpDmg
The first task was to analyse and report which types of events are the most harmful with respect to population health. NOAA storm database provides data on the number of fatalities and injuries caused by specific events. Let us analyze fatalities and injuries separately.
First, let us look at the top 10 event types by the number of fatalities.
fatal <- d.low %>%
select(EVTYPE, FATALITIES, INJURIES, absPropDmg,
absCrpDmg) %>%
group_by(EVTYPE) %>%
summarize(total.fatal = sum(FATALITIES)) %>%
arrange(desc(total.fatal))
fatal$EVTYPE <- factor(fatal$EVTYPE, levels = unique(fatal$EVTYPE))
f <- ggplot(head(fatal, 10), aes(x = as.factor(EVTYPE), y = total.fatal)) +
geom_bar(stat="identity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
xlab("Event type") +
ylab("Total number of fatalities") +
ggtitle("Top ten event types by the number of fatalities")
print(f)
Now let us analyze the top 10 event types by the number of injuries
injury <- d.low %>%
select(EVTYPE, FATALITIES, INJURIES, absPropDmg,
absCrpDmg) %>%
group_by(EVTYPE) %>%
summarize(total.injury = sum(INJURIES)) %>%
arrange(desc(total.injury))
injury$EVTYPE <- factor(injury$EVTYPE, levels = unique(injury$EVTYPE))
i <- ggplot(head(injury, 10), aes(x = as.factor(EVTYPE), y = total.injury)) +
geom_bar(stat="identity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
xlab("Event type") +
ylab("Total number of injuries") +
ggtitle("Top ten event types by the number of injuries")
print(i)
The dataset contains information regarding property damage and crop damage. Let us use the sum of these parameters as indicators of economic consequencies.
Now I am going to analyze top 10 types of events by the combined property and crop damage.
econ <- d.low %>%
select(EVTYPE, FATALITIES, INJURIES, econDmg) %>%
group_by(EVTYPE) %>%
summarize(total.econDmg = sum(econDmg)) %>%
arrange(desc(total.econDmg))
econ$EVTYPE <- factor(econ$EVTYPE, levels = unique(econ$EVTYPE))
ggplot(head(econ, 10), aes(x = as.factor(EVTYPE), y = total.econDmg)) +
geom_bar(stat="identity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
xlab("Event type") +
ylab("Economic impact ($)") +
ggtitle("Top ten event types by the amount of property damage and crop damage")