Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern for authorities. This study seeks to assist in this effort by identifying the weather events that are most harmful to public health and the economy.
Using data from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database, I show that Tornados and Lightening are by far the most responsible weather events for injuries and are also the identifiable events most responsible for fatalities in the United States. I also find that floods, tornados, and hail are most responsible for property and crop damages in the United States over the same period.
Anyone who seeks to reproduce this project can find all the code and data on my GitHub Repository.
The data used for this analysis comes from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage. It comes in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size.
Below, I will check for the availability of the data object in my work space, unzip and load the file, and cache the procedure to facilitate any future runs.
# Load Knitr library
library(knitr)
# Remove comment symbol from all results
opts_chunk$set(comment = NA)
# Check for data and load
if(!exists("noaa.data")) {
noaa.data <- read.csv(bzfile("repdata_data_StormData.csv.bz2"), header = TRUE)
}
# Check dimentions of Dataframe
dim(noaa.data)
## [1] 902297 37
As shown above, the data contains 902,297 observations and 37 variables. For this analysis, I am interested only in those variables related to weather events, public health, and the economy. These variables are: EVTYPE (weather event); FATALITIES (approximate number of deaths); INJURIES (approximate number of injuries); PROPDMG (approximate property damages); PROPDMGEXP (unit value of property damage); CROPDMG (approximate crop damages); CROPDMGEXP (units value of crop damage).
I will create a new dataframe with only these variables included.
# select variables relevant for the assignment
data <- noaa.data[, c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")]
# check missing values in data
colSums(is.na(data))
## EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 0 0 0 0 0 0 0
We find that there are no missing values in the data, but the events variable (EVTYPE), and the variables for dollar units of damages (PROPDMGEXP and CROPDMGEXP) are very untidy, with inconsistent capitalisation and spacing that makes several values appear as different when they are infact the same. I will now fix such values. For the EVTYPE variable, I will identify the most recurring values since the variable has 985 values. I will then collapse the values that relate to these. I will categorise the rest as “OTHER EVENT” For the other two, I will collapse all Ks, Ms and Bs into 1000, 1000,000 and 1,000,000,000 respectively irrespective of case. all the other values will be collapsed into 1.
# identify the most recurring values in EVTYPE
sort(table(data$EVTYPE), decreasing = TRUE)[1:20]
##
## HAIL TSTM WIND THUNDERSTORM WIND
## 288661 219940 82563
## TORNADO FLASH FLOOD FLOOD
## 60652 54277 25326
## THUNDERSTORM WINDS HIGH WIND LIGHTNING
## 20843 20212 15754
## HEAVY SNOW HEAVY RAIN WINTER STORM
## 15708 11723 11433
## WINTER WEATHER FUNNEL CLOUD MARINE TSTM WIND
## 7026 6839 6175
## MARINE THUNDERSTORM WIND WATERSPOUT STRONG WIND
## 5812 3796 3566
## URBAN/SML STREAM FLD WILDFIRE
## 3392 2761
# Tidy up values and collapse values that mean the same
data$EVTYPE <- gsub("^ ", "", data$EVTYPE)
data$EVTYPE <- gsub("^ ", "", data$EVTYPE)
data$EVTYPE[grep("^ABNORMAL", data$EVTYPE, ignore.case = TRUE)] <- "ABNORMAL WEATHER"
data$EVTYPE[grep("^ASTRONOMICAL", data$EVTYPE, ignore.case = TRUE)] <- "UNUSUAL TIDE"
data$EVTYPE[grep("^AVALANC", data$EVTYPE, ignore.case = TRUE)] <- "AVALANCHE"
data$EVTYPE[grep("^BEACH", data$EVTYPE, ignore.case = TRUE)] <- "BEACH EVENT"
data$EVTYPE[grep("^BLIZZARD", data$EVTYPE, ignore.case = TRUE)] <- "BLIZZARD EVENT"
data$EVTYPE[grep("^BLOWING SNOW", data$EVTYPE, ignore.case = TRUE)] <- "BLOWING SNOW"
data$EVTYPE[grep("^COASTAL", data$EVTYPE, ignore.case = TRUE)] <- "COASTAL EVENT"
data$EVTYPE[grep("^COLD", data$EVTYPE, ignore.case = TRUE)] <- "COLD CONDITIONS"
data$EVTYPE[grep("^DRY", data$EVTYPE, ignore.case = TRUE)] <- "DRY WEATHER"
data$EVTYPE[grep("^EARLY", data$EVTYPE, ignore.case = TRUE)] <- "EARLY ONSET OF WEATHER CONDITION"
data$EVTYPE[grep("HEAT", data$EVTYPE, ignore.case = TRUE)] <- "HEAT"
data$EVTYPE[grep("WIND", data$EVTYPE, ignore.case = TRUE)] <- "WIND"
data$EVTYPE[grep("RAIN", data$EVTYPE, ignore.case = TRUE)] <- "RAIN"
data$EVTYPE[grep("FLOOD", data$EVTYPE, ignore.case = TRUE)] <- "FLOOD"
data$EVTYPE[grep("FIRE", data$EVTYPE, ignore.case = TRUE)] <- "FIRE"
data$EVTYPE[grep("^WINTER", data$EVTYPE, ignore.case = TRUE)] <- "WINTER WEATHER"
data$EVTYPE[grep("WATER", data$EVTYPE, ignore.case = TRUE)] <- "WATERSPOUT"
data$EVTYPE[grep("HAIL", data$EVTYPE, ignore.case = TRUE)] <- "HAIL"
data$EVTYPE[grep("^TORNADO", data$EVTYPE, ignore.case = TRUE)] <- "TORNADO"
data$EVTYPE[grep("^FUNNEL", data$EVTYPE, ignore.case = TRUE)] <- "FUNNEL CLOUD"
data$EVTYPE[grep("^URBAN", data$EVTYPE, ignore.case = TRUE)] <- "URBAN/SML STREAM FLD"
f <- c("FLASH FLOOD", "FLASH FLOOD/", "FLASH FLOOD/FLOOD", "FLASH FLOODING", "FLASH FLOODING/FLOOD", "FLASH FLOODS", "FLASH FLOOODING", "FLOOD FLASH", "FLOOD FLOOD/FLASH", "FLOOD/FLASH", "Flood/Flash Flood", "FLOOD/FLASH FLOOD", "FLOOD/FLASH FLOODING", "FLOOD/FLASH/FLOOD", "FLOOD/FLASHFLOOD", "FLASH FLOOD - HEAVY RAIN", "FLASH FLOOD FROM ICE JAMS", "FLASH FLOOD LANDSLIDES", "FLASH FLOOD WINDS", "FLASH FLOOD/ STREET", "FLASH FLOOD/HEAVY RAIN", "FLASH FLOOD/LANDSLIDE", "FLASH FLOODING/THUNDERSTORM WI")
for(i in 1:length(f)) {
data$EVTYPE <- gsub(f[i], "FLASH FLOOD", data$EVTYPE)
}
d <- c(" LIGHTNING", "LIGHTNING", "LIGHTNING WAUSEON", "LIGHTNING AND HEAVY RAIN", "LIGHTNING AND THUNDERSTORM WIN", "LIGHTNING FIRE", "LIGHTNING INJURY", "LIGHTNING THUNDERSTORM WINDS", "LIGHTNING THUNDERSTORM WINDSS", "LIGHTNING.", "LIGHTNING/HEAVY RAIN")
for(i in 1:length(d)) {
data$EVTYPE <- gsub(d[i], "LIGHTNING", data$EVTYPE)
}
Because of the very many values in this variable, I will only collapse the variable into the dominant values and group the rest as “OTHER EVENT”. At this point, let us take a look at the most frequent weather events across the United States.
# identify the most recurring values.
sort(table(data$EVTYPE), decreasing = TRUE)[1:7]
##
## WIND HAIL FLOOD TORNADO WINTER WEATHER
## 364342 289276 81841 60684 19596
## LIGHTNING HEAVY SNOW
## 15759 15708
# Create a character vector of recurring values
recurrent <- c("WIND", "HAIL", "FLOOD", "TORNADO", "WINTER WEATHER", "LIGHTNING", "HEAVY SNOW")
# Create an index of values that are not recurrent
index <- !(data$EVTYPE %in% recurrent)
# Assign non-recurrent events as "Other Values"
data$EVTYPE[index] <- "OTHER EVENT"
# Make EVTYPE variable into a factor
data$EVTYPE <- factor(data$EVTYPE)
Next, I will look at the PROPDMGEXP variable and fix the units.
# identify values in PROPDMGEXP
sort(table(data$PROPDMGEXP), decreasing = TRUE)
##
## K M 0 B 5 1 2 ? m H
## 465934 424665 11330 216 40 28 25 13 8 7 6
## + 7 3 4 6 - 8 h
## 5 5 4 4 4 1 1 1
# collapse values that mean the same
data$PROPDMGEXP[grep("K", data$PROPDMGEXP, ignore.case = TRUE)] <- 1000
data$PROPDMGEXP[grep("M", data$PROPDMGEXP, ignore.case = TRUE)] <- 1000000
data$PROPDMGEXP[grep("B", data$PROPDMGEXP, ignore.case = TRUE)] <- 1000000000
# Create a numeric vector of recurring values
units <- c(1000, 1000000, 1000000000)
# Create an index of values that are not recurrent
index <- !(data$PROPDMGEXP %in% units)
# Assign 1 as the value for non-recurrent units
data$PROPDMGEXP[index] <- 1
Finally, I will look at the CROPDMGEXP variable and fix the units.
# identify values in CROPDMGEXP
sort(table(data$CROPDMGEXP), decreasing = TRUE)
##
## K M k 0 B ? 2 m
## 618413 281832 1994 21 19 9 7 1 1
# collapse values that mean the same
data$CROPDMGEXP[grep("K", data$CROPDMGEXP, ignore.case = TRUE)] <- 1000
data$CROPDMGEXP[grep("M", data$CROPDMGEXP, ignore.case = TRUE)] <- 1000000
data$CROPDMGEXP[grep("B", data$CROPDMGEXP, ignore.case = TRUE)] <- 1000000000
# Create an index of values that are not recurrent
index <- !(data$CROPDMGEXP %in% units)
# Assign 1 as the value for non-recurrent units
data$CROPDMGEXP[index] <- 1
I will now estimate the value of material and crop damage for each event by multiplying the damage values by their corresponding dollar units.
data$Property.Damage <- data$PROPDMG*as.numeric(data$PROPDMGEXP)
data$Crop.Damage <- data$CROPDMG*as.numeric(data$CROPDMGEXP)
In this section, I will explore the data to find out which types of weather events are most harmful to public health and the economy.
# load dplyr library
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# generate average number of injuries and fatalities for each category of weather event
health <- data %>%
group_by(EVTYPE) %>%
summarise(AverageInjuries = mean(INJURIES), AverageFatalities = mean(FATALITIES))
## `summarise()` ungrouping output (override with `.groups` argument)
# create a sub dataset for average fatalities only
health1 <- data.frame(Event = health$EVTYPE, Type = "Fatalities", Effect = health$AverageFatalities)
# create a sub dataset for average injuries only
health2 <- data.frame(Event = health$EVTYPE, Type = "Injuries", Effect = health$AverageInjuries)
# combine the injuries and fatalities sub data frames
HealthEffects <- rbind(health1, health2)
# load ggplot2 library
library(ggplot2)
# generate basic esthetics for mapping coordinates of the plot
HealthPlot <- ggplot(HealthEffects, aes(Event, Effect, fill = Type)) +
# Add bar graphs to the plot
geom_bar(stat = "identity") +
# Add labels to the plot
labs(title = "Health Effects of Weather Events in the United States",
subtitle = "1950 to 2011",
x = "Weather Event",
tag = "Figure 1",
y = "Average Effect per Weather Event")
print(HealthPlot)
We notice that Tornados and Lightening are by far the most responsible weather events for injuries and are also the identifiable events most responsible for fatalities, slightly lead by “Other Events.
# generate average value of property and crop dammage for each category of weather event
economy <- data %>%
group_by(EVTYPE) %>%
summarise(AveragePropertyDammage = mean(Property.Damage), AverageCropDamage = mean(Crop.Damage))
## `summarise()` ungrouping output (override with `.groups` argument)
# creat a sub dataframe for average property dammage only
econ1 <- data.frame(Event = economy$EVTYPE, Type = "Property Damage", Effect = economy$AveragePropertyDammage)
# create a sub data frame for average crop dammage only
econ2 <- data.frame(Event = economy$EVTYPE, Type = "Crop Damage", Effect = economy$AverageCropDamage)
# combine average property and crop dammage data frames into one
EconomicEffects <- rbind(econ1, econ2)
# generate basic esthetics for mapping coordinates of the plot
EconomicPlot <- ggplot(EconomicEffects, aes(Event, log(Effect), fill = Type)) +
# Add bar graphs to the plot
geom_bar(stat = "identity") +
# Add labels to the plot
labs(title = "Economic Effects of Weather Events in the United States",
subtitle = "1950 to 2011",
x = "Weather Event",
tag = "Figure 2",
y = "Natural Log of Average Loss in Dollars per Weather Event")
print(EconomicPlot)
Hear, we see that Floods inflict the most damage to both property and crops, while followed by Tornados for property and Hail for crops.