Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
The basic goal is to explore the NOAA Storm Database and answer the following questions about severe weather events:
Across the United States, which types of events are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?
The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size.
This data set is downloaded only in the case it does not already exist in the working directory. After that, it is decompressed and load into memory using the read.csv R command. This data set contains the raw data:
# downloading the data from the NOAA web
file_url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
file_name <-".\\repdata_data_StormData.csv.bz2"
if (!file.exists(file_name)) {
download.file(file_url, destfile = zip_file, mode = 'wb')
date_download <- date()
}
# decompressing and loading into memory
raw_data <- read.csv(file_name)
There is some documentation of the database available, such as information about the variables are constructed/defined in the National Weather Service Storm Data Documentation. The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
The variables that will be used for this analysis include:
EVTYPE: Event Type
FATALITIES: # of fatalities caused by each type of event
INJURIES: # of injuries caused by each type of event
PROPDMG: property damaged cost
PROPDMGEXP: units of the property damaged cost
The impact of severe weather events on the population health is represented by the fatalities and the injuries. So, both of them have been analyzed.
Fatalities
The number of fatalities by types of events were computed omitting the missing values.
fatalities_per_event <- aggregate(FATALITIES ~ EVTYPE, data = raw_data, FUN = sum, na.rm = TRUE)
The way to determine the most harmful types of events was using the 95th percentile of these fatalities and getting the top 0.5 above this percentile. For that purpose, the fatalities by event type were organized in ascending order and only values greater than zero were considered.
order_fatalities_asc <- fatalities_per_event[order(fatalities_per_event$FATALITIES), ]
# remove events that do not caused fatalities
order_fatalities_asc <- order_fatalities_asc[which(order_fatalities_asc$FATALITIES > 0), ]
# select most harmful fatalities by computing the 95th percentile
fatalities_95_percentil <- quantile(order_fatalities_asc$FATALITIES, c(.95))
# get the top 0.5 events
fatalities_most_harmful <- order_fatalities_asc[which(order_fatalities_asc$FATALITIES > fatalities_95_percentil), ]
Injuries
The number of injuries by types of events were computed omitting the missing values.
injuries_per_event <- aggregate(INJURIES ~ EVTYPE, data = raw_data, FUN = sum, na.rm = TRUE)
The way to determine the most harmful events was using the 95th percentile of these injuries and obtain the top 0.5 above this percentile. For that purpose, the injuries by event type were organized in ascending order and only values greater than zero were considered.
order_injuries_asc <- injuries_per_event[order(injuries_per_event$INJURIES), ]
# remove events that do not caused fatalities
order_injuries_asc <- order_injuries_asc[which(order_injuries_asc$INJURIES > 0), ]
# select most harmful fatalities by computing the 95th percentile
injuries_95_percentil <- quantile(order_injuries_asc$INJURIES, c(.95))
# get the top 0.5 events
injuries_most_harmful <- order_injuries_asc[which(order_injuries_asc$INJURIES > injuries_95_percentil), ]
Many severe events can result in property damage, which are represented in the data set by the PROPDMG variable. However, their units are defined by the variable PROPDMGEXP such as: H (100), K (1,000), M (1,000,000) or B (1,000,000,000). A pre-process task was done to convert all property damage amounts into dollars and stored in a new variable PROPDMG_dollar.
damage_data <- raw_data
#add a new column with the damage in dollars
damage_data$PROPDMG_dollar <- ifelse (damage_data$PROPDMGEXP == "B", damage_data$PROPDMG*10^9,
ifelse (damage_data$PROPDMGEXP %in% c("M", "m"), damage_data$PROPDMG*10^6,
ifelse (damage_data$PROPDMGEXP %in% c("K", "k"), damage_data$PROPDMG*10^3,
ifelse (damage_data$PROPDMGEXP %in% c("H", "h"), damage_data$PROPDMG*10^2,
damage_data$PROPDMG))))
Then, the property damage values in dollars were aggregated by event type omitting the missing values.
damage_per_event <- aggregate(PROPDMG_dollar ~ EVTYPE, data = damage_data, FUN = sum, na.rm = TRUE)
The same criteria to determine the most harmful events was used. The 95th percentile of these property damages was computed and the top 0.5 above this percentile was selected. For that purpose, the property damage values by event type were organized in ascending order and only values greater than zero were considered.
# order the data by damages in ascending order
order_damage_asc <- damage_per_event[order(damage_per_event$PROPDMG_dollar), ]
# remove events that do not caused damages
order_damage_asc <- order_damage_asc[which(order_damage_asc$PROPDMG_dollar > 0), ]
# select greatest economic consequences by computing the 95th percentile
damages_95_percentil <- quantile(order_damage_asc$PROPDMG_dollar, c(.95))
# get the top 0.5 events that cause worst damages
damages_most_harmful <- order_damage_asc[which(order_damage_asc$PROPDMG_dollar > damages_95_percentil), ]
A two rows panel depicts the most harmful event types regarding to fatalities and injuries. For that purpose the library ggplot2 is used to plot the information and the library cowplot is used to create a two rows panel. Notice that the event types are different for fatalities and injuries, and also the number of event types, because they were selected using the 95th percentile in each case.
library(cowplot)
## Warning: package 'cowplot' was built under R version 4.1.3
library(ggplot2)
# plot fatalities
fatalities_plot <- ggplot(fatalities_most_harmful, aes(x = FATALITIES, y = reorder(EVTYPE, FATALITIES))) +
geom_bar(stat = "identity", fill = "darkorange") +
xlab('Fatalities') +
ylab('Events')
# plot injuries
injuries_plot <- ggplot(injuries_most_harmful, aes(x = INJURIES, y = reorder(EVTYPE, INJURIES))) +
geom_bar(stat = "identity", fill = "darkturquoise") +
xlab('Injuries') +
ylab('Events')
plot_grid(fatalities_plot, injuries_plot, align = "v", nrow = 2)
These graphs show that tornado is the most harmful weather event regarding to population health, in terms of both fatalities and injuries.
The most harmful types of events regarding property damage are depicted in this plot using the library ggplot2. where the damage amount was big enough to represent as billion of dollars.
# plot damages
library(ggplot2)
ggplot(damages_most_harmful, aes(x = PROPDMG_dollar/10e9, y = reorder(EVTYPE, PROPDMG_dollar/10e9))) +
geom_bar(stat = "identity", fill = "blue") +
xlab('Property damage [billions of dollars]') +
ylab('Events') +
ggtitle('Property damage caused by most harmful events')
These graphs show that flood has the greatest economic consequences in property damage.