Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern. The main focus of this analysis is to unveil which types of weather events are the most dangerous for public health and for the economy. Focusing on the data spanning over the last 30 years of the dataset (1981-2011), we found out that the most damaging events for public health are tornadoes, which are worst in terms of fatalities and worst by far in terms of injuries. On the other hand, from the economic perspective, floods were the most damaging, mostly due to losses on property.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
First, it is necessary to load the some packages.
library(ggplot2)
library(plyr)
library(reshape2)
To begin with our analysis, we download and unzip the data.
setInternet2(TRUE)
##Setting environment
if (!file.exists("./data")){
dir.create("./data")
}
## Create temporary file
f <- "./data/StormData.csv.bz2"
url<-"https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
## Download the data
download.file(url, f)
## Import the data into R
data <- read.csv(bzfile(f), stringsAsFactors = FALSE)
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. As the measurements might have been prone to large errors in the past years, we chose to include only the last thirty years of data in our analysis.
# Set format
data$BGN_DATE <- as.Date(data$BGN_DATE, format = "%m/%d/%Y %H:%M:%S")
# Restrict to the last 30 years
data30 <- data[data$BGN_DATE >= as.Date("1981-01-01"), ]
Processing the categories data
It is necessary to note that the data categories are not in a perfect shape - there are cases of misspeling and duplication of categories.
Initially we tried to converting all characters to upper case.
Later we tried to simplify the data presentation using a standard in 9 categories, with a matrix 14x2 containing the most significative words representing the data.
In particular:
- SNOW category gathers events containing SNOW and BLIZZARD
- WIND category gathers events containing WIND, TSTM WIND, STORM
- WARM category gathers events containing HEAT and WARM
- OTHER category gathers events not contemplated in previous categories
# Clean Event Type...
data30$EVTYPE <- toupper(data30$EVTYPE)
Set the the new name and index of loop
#Set name of event
stdET <- matrix(c(c("TORNADO", "HAIL", "TSTM WIND", "RAIN", "FIRE", "FLOOD","SNOW",
"COLD", "WARM", "WIND", "BLIZZARD","STORM", "WINTER","HEAT"),c("TORNADO", "HAIL", "WIND",
"RAIN", "FIRE", "FLOOD","SNOW","COLD", "WARM", "WIND", "SNOW","WIND","COLD","WARM")),
nrow=14,ncol=2)
#Set index of loop
n30 <- nrow(data30) # nrow of of data30
t <- dim(stdET)[1] # length of stdET standard
NoT <- "OTHER" # no stdET word match in EVTYPE
The following routine substitutes generic event with standard one
## Substitute generic event with STD EVENT 30 years
for (i in 1:n30) {
flag = FALSE
for (j in 1:t){
if (grepl(stdET[j,1],data30$EVTYPE[i])) {
data30$EVTYPE[i] <- stdET[j,2]
flag = TRUE
}
}
if(!flag){
data30$EVTYPE[i] <- NoT
}
}
This method reduce the type of weather events but can require an excessive time usage, so it has not been used in this presentation
Processing the public health damage data
The public health damage data need to be summarized so as to show the number of injuries and fatalities by event name. Finally, top 10 events resulting in injuries and top 10 events resulting in fatalities are selected. Finally, the data is melt by the reshape2 package so that we can utilize it later in ggplot charts.
# Make sums of injuries and fatalities
sumhealth <- ddply(data30, .(EVTYPE), summarise, fatalities = sum(FATALITIES),
injuries = sum(INJURIES))
## Select ten most harmful events
topfatalities <- head(sumhealth[order(sumhealth$fatalities, decreasing = T),
], n = 10)[, c(1, 2)]
topinjuries <- head(sumhealth[order(sumhealth$injuries, decreasing = T), ],
n = 10)[, c(1, 3)]
## Prepare data for the barchart
forchart1 <- melt(topfatalities)
## Using EVTYPE as id variables
forchart2 <- melt(topinjuries)
## Using EVTYPE as id variables
Processing the economic damage data
The economic damage data is present in the form of a base and a multiplier (in the form of abbreviations). Hence, we multiply the base numbers by multipliers.
# Property damage multiplier: prepare and use to multiply the damage
data30$PROPDMGEXP[is.na(data30$PROPDMGEXP)] <- 0
data30$PROPDMGEXP[data30$PROPDMGEXP == ""] <- 1
data30$PROPDMGEXP[grep("[-+?]", data30$PROPDMGEXP)] <- 1
data30$PROPDMGEXP[grep("[Hh]", data30$PROPDMGEXP)] <- 100
data30$PROPDMGEXP[grep("[Kk]", data30$PROPDMGEXP)] <- 1000
data30$PROPDMGEXP[grep("[Mm]", data30$PROPDMGEXP)] <- 1e+06
data30$PROPDMGEXP[grep("[Bb]", data30$PROPDMGEXP)] <- 1e+09
data30$PROPDMGEXP <- as.numeric(data30$PROPDMGEXP)
data30$PROPDMG <- data30$PROPDMGEXP * data30$PROPDMG
# Crop damage multiplier: prepare and use to multiply the damage
data30$CROPDMGEXP[is.na(data30$CROPDMGEXP)] <- 0
data30$CROPDMGEXP[data30$CROPDMGEXP == ""] <- 1
data30$CROPDMGEXP[grep("[-+?]", data30$CROPDMGEXP)] <- 1
data30$CROPDMGEXP[grep("[Hh]", data30$CROPDMGEXP)] <- 100
data30$CROPDMGEXP[grep("[Kk]", data30$CROPDMGEXP)] <- 1000
data30$CROPDMGEXP[grep("[Mm]", data30$CROPDMGEXP)] <- 1e+06
data30$CROPDMGEXP[grep("[Bb]", data30$CROPDMGEXP)] <- 1e+09
data30$CROPDMGEXP <- as.numeric(data30$CROPDMGEXP)
data30$CROPDMG <- data30$CROPDMGEXP * data30$CROPDMG
Similarly to the health data processing, the economic damage figures are first summarized according to the type of event. Subsequently, top 10 events with the highest economic impact (defined as damage to crops plus damage to property) were selected.
Finaly, the data was prepared for ggplots with the melt function.
# Make sums f injuries and fatalities
sumecon <- ddply(data30, .(EVTYPE), summarise, cropdmg = sum(CROPDMG), propdmg = sum(PROPDMG))
sumecon$totaldamage <- sumecon$cropdmg + sumecon$propdmg
## Select top 10
topecon <- head(sumecon[order(sumecon$totaldamage, decreasing = T), ], n = 10)
## Prepare data for the barchart
forchart3 <- melt(topecon)
## Using EVTYPE as id variables
forchart3 <- forchart3[forchart3$variable != "totaldamage", ]
Question 1: Public health
The following table and chart present the 10 most damaging events from the perspective of fatalities.
## EVTYPE fatalities
## 758 TORNADO 2246
## 116 EXCESSIVE HEAT 1903
## 138 FLASH FLOOD 978
## 243 HEAT 937
## 418 LIGHTNING 816
## 779 TSTM WIND 504
## 154 FLOOD 470
## 524 RIP CURRENT 368
## 320 HIGH WIND 248
## 19 AVALANCHE 224
## Make the barchart
ggplot(forchart1, aes(x = factor(forchart1$EVTYPE), y = forchart1$value, fill =
variable)) + geom_bar(stat = "identity", position = "dodge") +
theme(axis.text.x = element_text(angle = -270), plot.title =
element_text(face = "bold")) + labs(x = "Weather event", y =
"Number of injuries") + theme(legend.position = "none") + ggtitle("Fatalities")
The event with the highest number of fatalities during the last 30 years of the dataset was TORNADO followed by EXCESSIVE HEAT.
The following table and chart present the 10 most damaging events from the perspective of injuries.
## EVTYPE injuries
## 758 TORNADO 36814
## 779 TSTM WIND 6957
## 154 FLOOD 6789
## 116 EXCESSIVE HEAT 6525
## 418 LIGHTNING 5230
## 243 HEAT 2100
## 387 ICE STORM 1975
## 138 FLASH FLOOD 1777
## 685 THUNDERSTORM WIND 1488
## 212 HAIL 1361
## Make the barchart
ggplot(forchart2, aes(x = factor(forchart2$EVTYPE), y = forchart2$value, fill =
variable)) + geom_bar(stat = "identity", position = "dodge") + theme(axis.text.x
= element_text(angle = -270), plot.title = element_text(face = "bold")) +
labs(x = "Weather event", y = "Number of injuries") + theme(legend.position =
"none") + ggtitle("Injuries")
The event with the highest number of injuries during the last 30 years of the dataset was TORNADO.
To sum up, TORNADO seems to be the most damaging event from the perspective of public health.
Question 2: Economic damage
The following table and chart present the 10 most damaging events from the perspective of economic damage.
## EVTYPE cropdmg propdmg totaldamage
## 154 FLOOD 5661968450 144657709807 150319678257
## 372 HURRICANE/TYPHOON 2607872800 69305840000 71913712800
## 599 STORM SURGE 5000 43323536000 43323541000
## 758 TORNADO 414953110 41485762374 41900715484
## 212 HAIL 3025954453 15732267427 18758221880
## 138 FLASH FLOOD 1421317100 16140812294 17562129394
## 84 DROUGHT 13972566000 1046106000 15018672000
## 363 HURRICANE 2741910000 11868319010 14610229010
## 529 RIVER FLOOD 5029459000 5118945500 10148404500
## 387 ICE STORM 5022113500 3944927810 8967041310
## Make the barchart
ggplot(forchart3, aes(x = factor(forchart3$EVTYPE), y = forchart3$value, fill =
variable)) + geom_bar(stat = "identity") + theme(axis.text.x = element_text(
angle = -270), plot.title = element_text(face = "bold")) + labs(x =
"Weather event", y = "Economic damage (USD)") + scale_fill_discrete(name = "Type of
damage", labels = c("Crop", "Property")) + theme(legend.position = "top") +
ggtitle("Economic impact")
The event with the highest economic damage during the last 30 years of the dataset was by FLOODS.