By Djoko Soehartono
Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
Data analysis will address the following questions:
Across the United States, which types of events (as indicated in the EVTYPE
variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?
Findings:
Tornados cause the most fatalities.
Tornados cause the most injuries.
Flooding causes the most damage in dollar amount.
Setting the basics first:
echo = TRUE # Always make code visible
library(R.utils)
## Loading required package: R.oo
## Loading required package: R.methodsS3
## R.methodsS3 v1.7.0 (2015-02-19) successfully loaded. See ?R.methodsS3 for help.
## R.oo v1.19.0 (2015-02-27) successfully loaded. See ?R.oo for help.
##
## Attaching package: 'R.oo'
##
## The following objects are masked from 'package:methods':
##
## getClasses, getMethods
##
## The following objects are masked from 'package:base':
##
## attach, detach, gc, load, save
##
## R.utils v2.1.0 (2015-05-27) successfully loaded. See ?R.utils for help.
##
## Attaching package: 'R.utils'
##
## The following object is masked from 'package:utils':
##
## timestamp
##
## The following objects are masked from 'package:base':
##
## cat, commandArgs, getOption, inherits, isOpen, parse, warnings
library(ggplot2)
library(plyr)
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Download the data set and load it into R.
if(!exists("data_storm")) {
data_filename <- "./repdata-data-StormData.csv.bz2"
if (!file.exists(data_filename)) {
data_url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(url = data_url, destfile = data_filename)
}
data_storm <- read.csv(data_filename)
}
Prior to the analysis, the data need some preprocessing to solve event types without specific format. For instance, there are events with types Frost/Freeze
, FROST/FREEZE
and FROST\\FREEZE
which obviously refer to the same type of event.
# number of unique event types
length(unique(data_storm$EVTYPE))
## [1] 985
# translate all event types to uppercase
event_types <- toupper(data_storm$EVTYPE)
# replace all punctual characters with a space
event_types <- gsub("[[:blank:][:punct:]+]", " ", event_types)
# update the data frame
data_storm$EVTYPE <- event_types
length(unique(data_storm$EVTYPE))
## [1] 874
After the cleaning, as expected, the number of unique event types reduce significantly. For further analysis, the cleaned event types are used, although we can process further to merge event types such as TSTM WIND
and THUNDERSTORM WIND
.
In this section, we check the number of fatalities and injuries that are caused by the severe weather events. We would like to get the first 10 most severe types of weather events.
# Select related columns for events with casualties only
casualties <- data_storm %>% select(EVTYPE, FATALITIES, INJURIES) %>%
group_by(EVTYPE) %>%
summarize(SUM_FATALITIES=sum(FATALITIES), SUM_INJURIES=sum(INJURIES))
# Determine top 10 events with largest number of fatalities
top_fatalities <- head(casualties[order(casualties$SUM_FATALITIES,
decreasing = T), c("EVTYPE", "SUM_FATALITIES")], 5)
top_fatalities$EVTYPE <- factor(top_fatalities$EVTYPE, levels=top_fatalities$EVTYPE)
# Determine top 10 events with largest number of injuries
top_injuries <- head(casualties[order(casualties$SUM_INJURIES, decreasing = T),
c("EVTYPE", "SUM_INJURIES")], 5)
top_injuries$EVTYPE <- factor(top_injuries$EVTYPE, levels=top_injuries$EVTYPE)
Compile data on damages. Combine property damage and crop damage to use as damage amount. Data set has an amount and an exponential code for both property and crop damage. Use a function to convert these two values into a dollar amount.
# Function to compute total damage amount
damage_amount <- function(amount, exponent)
{
returnAmount <- 0
if (toupper(exponent)[1]=="H") {
returnAmount <- (amount * 100)}
if (toupper(exponent)[1]=="K") {
returnAmount <- (amount * 1000)}
if (toupper(exponent)[1]=="M") {
returnAmount <- (amount * 1000000)}
if (toupper(exponent)[1]=="B") {
returnAmount <- (amount * 1000000000)}
if (toupper(exponent)[1] %in% c('', '-', '?', '+','0','1','2','3','4',
'5','6','7','8','9')) {
returnAmount <- amount }
return(returnAmount)
}
data_damage <- data_storm %>% select(EVTYPE, PROPDMG, PROPDMGEXP,
CROPDMG,CROPDMGEXP) %>%
filter(PROPDMG > 0 | CROPDMG > 0)
data_damage$DMGAMT <- ((mapply(damage_amount, data_damage$PROPDMG,
data_damage$PROPDMGEXP)) +
(mapply(damage_amount, data_damage$CROPDMG,
data_damage$CROPDMGEXP)))
dmgamt_grouped <- data_damage %>% group_by(EVTYPE) %>%
summarize(SUM_DMGAMT=sum(DMGAMT))
top_dmgamt <- head(dmgamt_grouped[order(dmgamt_grouped$SUM_DMGAMT,decreasing=T),
c("EVTYPE", "SUM_DMGAMT")], 5)
top_dmgamt$EVTYPE <- factor(top_dmgamt$EVTYPE, levels=top_dmgamt$EVTYPE)
Plot fatalities
ggplot(top_fatalities, aes(x=EVTYPE, y=SUM_FATALITIES)) +
geom_bar(stat="identity", fill = "#222288") +
xlab("Types") +
ylab("Fatalities") +
ggtitle("Total Fatalities By Weather Events")
Tornadoes cause the most fatalities with more than 5,000 death in the last 60 years in the US. Excessive heats and flash floods come in second and third causing the most fatalities.
Plot injuries
ggplot(top_injuries, aes(x=EVTYPE, y=SUM_INJURIES)) +
geom_bar(stat="identity", fill = "#222288") +
xlab("Types") +
ylab("Injuries") +
ggtitle("Total Injuries By Weather Events")
Tornadoes also cause the most injuries with more than 90,000 numbers recorded in the last 60 years, far above thunderstorm winds and floods in second and third places respectively. Note that due to untidy nature of the data, we should merge TSTM WIND
and THUNDERSTORM WIND
for a more accurate analysis.
Plot total damage amount in dollar
ggplot(top_dmgamt, aes(x=EVTYPE, y=SUM_DMGAMT)) +
geom_bar(stat="identity", fill = "#882222") +
xlab("Types") +
ylab("Damage Amount (In Dollars)") +
ggtitle("Total Damage Amount By Weather Events")
Floods cause the most damage in the lastt 60 years with total damage at amount of more than USD 150 billions, followed by hurricane/typhoon and tornadoes. Again, due to untidy nature of the data, we should merge FLOOD
and FLASH FLOOD
for a more accurate analysis.