We’ve analysed data on weather events from the NOAA storm database to see which weather events have most effect on human health and which events have most economic damage.
(Excessive) heat and tornados have caused most casualties between 1996 and 2011. By far most injuries were caused by tornados.
Most economic damage between 1996 and 2011 was caused by floods and hurricanes.
Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This document involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
The raw data used for the analysis can be found using this link: https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2
There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.
The events in the database start in the year 1950 and end in November 2011. Not all years have data on all types of storm events (source: https://www.ncdc.noaa.gov/stormevents/details.jsp):
The goal of the analysis is to answer following questions:
1. Across the United States, which types of events are most harmful with respect to population health?
2. Across the United States, which types of events have the greatest economic consequences?
Download and decompress data.
## Downloading the original data
## Check if the data-directory exists. If not create it.
if (!file.exists("./data")) {
dir.create("./data")
}
fileUrl <-
"https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
## Check if the file is already downloaded, decompressed and available in the data-directory otherwise download and decompress the file
if (!file.exists("./data/StormData.csv.bz2")) {
library(R.utils)
download.file(fileUrl, destfile = "./data/StormData.csv")
bunzip2(
"./data/StormData.csv.bz2",
destname = "./data/StormData.csv",
remove = FALSE,
skip = TRUE
)
}
Loading the data for analysis.
## read the data into a dataframe
RawStormData <- read.csv("./data/StormData.csv")
Extract only the data needed for analysis.
Following selections were used:
EVTYPE - Event typeFATALITIES - Number of fatalitiesINJURIES - Number of injuriesPROPDMG - Damage to propertiesPROPDMGEXP - Unit size of damage to propertiesCROPDMG - Damage to cropsCROPDMGEXP - Unit size of damage to crops## load lubridate package
library(lubridate)
## transform BGN_DATE from a Factor variable to a Date variable
RawStormData$BGN_DATE <- mdy_hms(RawStormData$BGN_DATE)
## select only data for year greater or equal to 1996
TransformStormData <-
RawStormData[which(year(RawStormData$BGN_DATE) >= 1996),]
## select only data with fatalities, injuries, damage to crops, damage to properties
TransformStormData <-
TransformStormData[which(
TransformStormData$FATALITIES + TransformStormData$INJURIES + TransformStormData$CROPDMG +
TransformStormData$PROPDMG > 0
), ]
## keep only the relevant variables
myvars <-
c(
"EVTYPE",
"FATALITIES",
"INJURIES",
"PROPDMG",
"PROPDMGEXP",
"CROPDMG",
"CROPDMGEXP"
)
TransformStormData <- TransformStormData[myvars]
After this step there are 201,318 observations in 7 variables in the data.
Calculate total damage amounts CROPDMGEXP and PROPDMGEXP can hold followning values:
To make the calculations we have added 3 variables:
PROPERTYDAMAGECROPDAMAGETOTALDAMAGE (=PROPERTYDAMAGE+CROPDAMAGE)And removed the variables:
CROPDMGEXPCROPDMGPROPDMGEXPPROPDMG## add variables with values instead of letters for units
TransformStormData$PROPEXP[toupper(TransformStormData$PROPDMGEXP) == "K"] <-
1000
TransformStormData$PROPEXP[toupper(TransformStormData$PROPDMGEXP) == "M"] <-
1000000
TransformStormData$PROPEXP[TransformStormData$PROPDMGEXP == ""] <- 1
TransformStormData$PROPEXP[toupper(TransformStormData$PROPDMGEXP) == "B"] <-
1000000000
TransformStormData$PROPEXP[TransformStormData$PROPDMGEXP == "0"] <-
1
TransformStormData$PROPEXP[TransformStormData$PROPDMGEXP == "5"] <-
100000
TransformStormData$PROPEXP[TransformStormData$PROPDMGEXP == "6"] <-
1000000
TransformStormData$PROPEXP[TransformStormData$PROPDMGEXP == "4"] <-
10000
TransformStormData$PROPEXP[TransformStormData$PROPDMGEXP == "2"] <-
100
TransformStormData$PROPEXP[TransformStormData$PROPDMGEXP == "3"] <-
1000
TransformStormData$PROPEXP[TransformStormData$PROPDMGEXP == "7"] <-
10000000
TransformStormData$PROPEXP[toupper(TransformStormData$PROPDMGEXP) == "H"] <-
100
TransformStormData$PROPEXP[TransformStormData$PROPDMGEXP == "1"] <-
10
TransformStormData$PROPEXP[TransformStormData$PROPDMGEXP == "8"] <-
100000000
TransformStormData$PROPEXP[TransformStormData$PROPDMGEXP == "+"] <-
0
TransformStormData$PROPEXP[TransformStormData$PROPDMGEXP == "-"] <-
0
TransformStormData$PROPEXP[TransformStormData$PROPDMGEXP == "?"] <-
0
TransformStormData$CROPEXP[toupper(TransformStormData$CROPDMGEXP) == "M"] <-
1000000
TransformStormData$CROPEXP[toupper(TransformStormData$CROPDMGEXP) == "K"] <-
1000
TransformStormData$CROPEXP[toupper(TransformStormData$CROPDMGEXP) == "B"] <-
1000000000
TransformStormData$CROPEXP[TransformStormData$CROPDMGEXP == "0"] <-
1
TransformStormData$CROPEXP[TransformStormData$CROPDMGEXP == "2"] <-
100
TransformStormData$CROPEXP[TransformStormData$CROPDMGEXP == ""] <- 1
TransformStormData$CROPEXP[TransformStormData$CROPDMGEXP == "?"] <-
0
## multiply to get damage in dollars
TransformStormData$PROPERTYDAMAGE <-
TransformStormData$PROPDMG * TransformStormData$PROPEXP
TransformStormData$CROPDAMAGE <-
TransformStormData$CROPDMG * TransformStormData$CROPEXP
TransformStormData$TOTALDAMAGE <-
TransformStormData$PROPERTYDAMAGE + TransformStormData$CROPDAMAGE
## remove variables CROPEXP, CROPDMGEXP, CROPDMG, PROPEXP, PROPDMGEXP, PROPDMG, PROPERTYDAMAGE, CROPDAMAGE
myvars <-
names(TransformStormData) %in% c(
"CROPEXP",
"CROPDMGEXP",
"CROPDMG",
"PROPEXP",
"PROPDMGEXP",
"PROPDMG",
"PROPERTYDAMAGE",
"CROPDAMAGE"
)
TransformStormData <- TransformStormData[!myvars]
This has reduced the data to 201,318 observations in 4 variables.
Event types
According to the documentation there are 48 event types defined. However there are 985 unique values in the dataset. After removing aal numeric values, parenthesis, leading, trailing and double spaces and making all values upper (or lower) case, there are still 181 different values remaining.
## make all EVTYPE upper case
TransformStormData$EVTYPE <- toupper(TransformStormData$EVTYPE)
## remove all numeric values from strings (with leading G)
TransformStormData$EVTYPE <-
gsub("G[[:digit:]]+", "", TransformStormData$EVTYPE)
TransformStormData$EVTYPE <-
gsub("[[:digit:]]+", "", TransformStormData$EVTYPE)
## remove parenthesis from strings
TransformStormData$EVTYPE <-
gsub("\\(", "", TransformStormData$EVTYPE)
TransformStormData$EVTYPE <-
gsub("\\)", "", TransformStormData$EVTYPE)
## remove double, leading and trailing spaces
TransformStormData$EVTYPE <-
gsub(pattern = "\\s+",
replacement = " ",
TransformStormData$EVTYPE)
TransformStormData$EVTYPE <-
sub("^\\s+", "", TransformStormData$EVTYPE)
TransformStormData$EVTYPE <-
sub("\\s+$", "", TransformStormData$EVTYPE)
There are some common typos that can be handled
## replace TSTM by THUNDERSTORM
TransformStormData$EVTYPE <-
gsub(pattern = "TSTM",
replacement = "THUNDERSTORM",
TransformStormData$EVTYPE)
## replace WINDS by WIND
TransformStormData$EVTYPE <-
gsub(pattern = "WINDS",
replacement = "WIND",
TransformStormData$EVTYPE)
## replace CURRENTS by CURRENT
TransformStormData$EVTYPE <-
gsub(pattern = "CURRENTS",
replacement = "CURRENT",
TransformStormData$EVTYPE)
## replace all "/" with ""
TransformStormData$EVTYPE <-
gsub(pattern = "/",
replacement = " ",
TransformStormData$EVTYPE)
## replace all strings with "COLD" or "WIND CHILL" with "COLD WIND CHILL"
TransformStormData$EVTYPE <-
gsub(pattern = ".*COLD.*",
replacement = "COLD WIND CHILL",
TransformStormData$EVTYPE)
TransformStormData$EVTYPE <-
gsub(pattern = ".*WIND CHILL.*",
replacement = "COLD WIND CHILL",
TransformStormData$EVTYPE)
## replace "FLOOD FLASH FLOOD" and "FLASH FLOOD FLOOD" with "FLASH FLOOD"
TransformStormData$EVTYPE <-
gsub(pattern = "FLOOD FLASH FLOOD",
replacement = "FLASH FLOOD",
TransformStormData$EVTYPE)
TransformStormData$EVTYPE <-
gsub(pattern = "FLASH FLOOD FLOOD",
replacement = "FLASH FLOOD",
TransformStormData$EVTYPE)
There is no time to recode every single event type. So we will focus on the event types with the highest impact.
## Replace "EXTREME COLD" with "EXTREME COLD WIND CHILL"
TransformStormData$EVTYPE <-
gsub(pattern = "EXTREME COLD",
replacement = "EXTREME COLD WIND CHILL",
TransformStormData$EVTYPE)
## Replace strings containing "HURRICANE" with "HURRICANE"
TransformStormData$EVTYPE <-
gsub(pattern = ".*HURRICANE.*",
replacement = "HURRICANE",
TransformStormData$EVTYPE)
## Replace "HEAVY SURF HIGH SURF", "HEAVY SURF" and "HEAVY SURF AND WIND" with "HIGH SURF"
TransformStormData$EVTYPE <-
gsub(pattern = "HEAVY SURF HIGH SURF",
replacement = "HIGH SURF",
TransformStormData$EVTYPE)
TransformStormData$EVTYPE <-
gsub(pattern = "HEAVY SURF",
replacement = "HIGH SURF",
TransformStormData$EVTYPE)
TransformStormData$EVTYPE <-
gsub(pattern = "HEAVY SURF AND WIND",
replacement = "HIGH SURF",
TransformStormData$EVTYPE)
## Replace "LANDSLIDE*" with "DEBRIS FLOW"
TransformStormData$EVTYPE <-
gsub(pattern = "LANDSLIDE.*",
replacement = "DEBRIS FLOW",
TransformStormData$EVTYPE)
## Replace "WILD FOREST FIRE" with "WILDFIRE"
TransformStormData$EVTYPE <-
gsub(pattern = "WILD FOREST FIRE",
replacement = "WILDFIRE",
TransformStormData$EVTYPE)
## Replace "STORM SURGE" with "STORM SURGE TIDE"
TransformStormData$EVTYPE <-
gsub(pattern = "STORM SURGE",
replacement = "STORM SURGE TIDE",
TransformStormData$EVTYPE)
## Replace "FOG" with "DENSE FOG"
TransformStormData$EVTYPE <-
gsub(pattern = "FOG",
replacement = "DENSE FOG",
TransformStormData$EVTYPE)
To get the results we first need to aggregate the data on the event type
## load dplyr package
library(dplyr)
StormData <-
TransformStormData %>% group_by(EVTYPE) %>% summarize_all(funs(sum))
## make EVTYPE a factor variable
StormData$EVTYPE <- as.factor(StormData$EVTYPE)
First we will look at the event types that caused the most fatalities and injuries. Below there is a top 10 of these events.
## order both fatalities and injuries in descending order
FatalitiesOrdered <-
StormData[order(StormData$FATALITIES, decreasing = TRUE), ]
InjuriesOrdered <-
StormData[order(StormData$INJURIES, decreasing = TRUE), ]
## now produce a top 10 barchart for both fatalities and injuries
par(mar = c(12, 5, 1 , 1))
barplot(
FatalitiesOrdered$FATALITIES[1:10] / 100,
names.arg = FatalitiesOrdered$EVTYPE[1:10],
las = 2,
main = "Top 10 events with most fatalities",
ylab = "number of fatalities x 100"
)
barplot(
InjuriesOrdered$INJURIES[1:10] / 100,
names.arg = InjuriesOrdered$EVTYPE[1:10],
las = 2,
main = "Top 10 events with most injuries",
ylab = "number of injuries x 100"
)
It is clear that tornados are very harmfull. They cause a lot of fatalities and injuries. Excessive heat also causes a lot of fatalities. One could say that these events cause most harm to human health.
Now let’s take a look at the events with the most economic damage.
## order both fatalities and injuries in descending order
TotalDamageOrdered <-
StormData[order(StormData$TOTALDAMAGE, decreasing = TRUE), ]
## now produce a top 10 barchart for both fatalities and injuries
par(mar = c(12, 5, 1 , 1))
barplot(
TotalDamageOrdered$TOTALDAMAGE[1:10] / 1000000000,
names.arg = TotalDamageOrdered$EVTYPE[1:10],
las = 2,
main = "Top 10 events with most economic damage",
ylab = "economic damage in billion dollars"
)
Floods and hurricanes have a lot of economic damage.
Different points of view (healt versus damage) have different results which weather events are most harmfull.