Storms and other weather events are responsible for heavy economic consequences, but also represent dangerous risks for the population itself, causing injuries and even fatalities in the most severe cases. Gaining a better understanding of the risks linked to specific weather events is a key step in prevention for future catastrophies. In this assignment, the Storm Database collected by the U.S. National Oceanic and Atmospheric Administration (NOAA) will be explored. This particular database tracks information of major weather events that occurred in the United States between 1950 and 2011, such as their dates, their geographical locations and estimates of any fatalities, injuries and/or property damage. Using this data, which will be preprocessed and analyzed, the questions of property damage and population’s health related to weather events will be touched on in this report.
We start by loading the R packages that will be used in this assignment.
# Load the needed packages for the assignment
library(dplyr)
library(ggplot2)
library(tidyr)
Then, we load the Storm data via the link given in the assignment,
and save the bz2 file under
repdata_data_StormData.csv.bz2. If a pre-existing file with
this name is already present in the directory, this step is skipped. The
full raw table is then read into the raw_table
variable.
# Load the data from the url if the data folder does not already exist in the correct folder
data_folder_name <- "repdata_data_StormData.csv.bz2"
if (!file.exists(data_folder_name)) {
data_url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(data_url, data_folder_name, method="curl")
}
# Read the data into a table
raw_table <- read.table("repdata_data_StormData.csv.bz2", sep=",", header=TRUE)
We take a first look on the data. The table’s dimension is (902297, 37), and the feature names of the 37 columns are listed below, with the first 3 samples of the table.
# Check the dimensions of the datable, and the 3 first samples present in the table
cat("The dimension of the raw table is:", dim(raw_table))
## The dimension of the raw table is: 902297 37
head(raw_table, 3)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL TORNADO
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL TORNADO
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL TORNADO
## BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## 1 0 0 NA
## 2 0 0 NA
## 3 0 0 NA
## END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG
## 1 0 14.0 100 3 0 0 15 25.0
## 2 0 2.0 150 2 0 0 0 2.5
## 3 0 0.1 123 2 0 0 2 25.0
## PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE
## 1 K 0 3040 8812
## 2 K 0 3042 8755
## 3 K 0 3340 8742
## LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3051 8806 1
## 2 0 0 2
## 3 0 0 3
Not all features are needed for the demanded tasks. Therefore, only
the relevant columns will be kept for the rest of this assignment. The
columns are chosen in the selected_column array, and they
concern the following features:
EVTYPE, which gives the name and type of
event.
FATALITIES and INJURIES, which
respectively give the number of fatalities and injured people for a
specific event. Those columns are of interest when considering which
events are the most harmful with respect to the population
health.
PROPDMG and PROPDMGEXP, which indicate
the economical expense due to property damage in dollars. The value is
given in PROPDMG, while the magnitude is indicated in the
PROPDMGEXP.
CROPDMG and CROPDMGEXP, which are
similar to the previous columns, but in the case of crops.
The chosen columns are saved in selected_df.
# Select the columns that are relevant for the analysis and create a new dataframe
# containing only those columns
selected_columns <- c("EVTYPE","FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")
selected_df <- raw_table %>%
select(all_of(selected_columns))
As described above, the PROPDMGEXP and
CROPDMGEXP columns contain the order of magnitude of the
value saved in PROPDMG and CROPDMG. More
specifically, they can contain one of the following letters :
h for hundred, k for thousand, m
for million, or b for billion. To prepare the data for
plotting, the values saved in PROPDMG and
CROPDMG are therefore multiplied by a factor of \(10^2\), \(10^3\), \(10^6\), \(10^9\) depending on their orders of
magnitude. If no order of magnitude is indicated, the value is kept as
it is. This operation is done below. In addition, to take into account
both uppercase and lowercase letters, all the letters are transformed
into lowercase before applying the factor.
The new price values are saved in the columns PROPDMG2
and CROPDMG2.
# Multiply the costs with the right order of magnitude factor and save it in
# a new column
selected_df <- selected_df %>%
mutate(CROPDMGEXP = tolower(CROPDMGEXP)) %>%
mutate(PROPDMGEXP = tolower(PROPDMGEXP)) %>%
mutate(PROPDMG2 = case_when(
grepl("h", PROPDMGEXP, fixed = TRUE) ~ PROPDMG * 1e2,
grepl("k", PROPDMGEXP, fixed = TRUE) ~ PROPDMG * 1e3,
grepl("m", PROPDMGEXP, fixed = TRUE) ~ PROPDMG * 1e6,
grepl("b", PROPDMGEXP, fixed = TRUE) ~ PROPDMG * 1e9,
TRUE ~ PROPDMG
)) %>%
mutate(CROPDMG2 = case_when(
grepl("h", CROPDMGEXP, fixed = TRUE) ~ CROPDMG * 1e2,
grepl("k", CROPDMGEXP, fixed = TRUE) ~ CROPDMG * 1e3,
grepl("m", CROPDMGEXP, fixed = TRUE) ~ CROPDMG * 1e6,
grepl("b", CROPDMGEXP, fixed = TRUE) ~ CROPDMG * 1e9,
TRUE ~ CROPDMG ))
After this first preprocessing step, the total sum of fatalities and
injuries by event type is calculated below. A new table
table_health is created for this purpose, containing 4
variables :
EVTYPE, the type of event.
total_fatalities, the sum of the total number of
fatalities for this particular type of event.
total_injuries, the sum of the total number of
injuries for this particular type of event.
total_fatalities_injuries, the sum of both the total
number of injuries and total number of fatalities for this particular
event.
Rows which have a NA value in either the
INJURIES column or FATALITIES column are
ignored.
# Create a new dataframe containing the sum of total fatalities, injuries, and
# both by grouping the dataframe by event.
table_health <- selected_df %>%
filter(!is.na(FATALITIES) & !is.na(INJURIES)) %>%
group_by(EVTYPE) %>%
summarise(total_fatalities = sum(FATALITIES, na.rm = TRUE),
total_injuries = sum(INJURIES, na.rm = TRUE),
total_fatalities_injuries = total_fatalities + total_injuries) %>%
arrange(desc(total_fatalities_injuries))
A similar table, called table_economic, is prepared for
the economic cost of an event. It contains the 4 following
variables:
EVTYPE, the type of event.
total_propdmg, the total cost of property damage due
to this particular event.
total_cropdmg, the total cost of crop damage due to
this particular event.
total_dmg, the sum of the cost of property damage
and crop damage for this particular event.
# Create a new dataframe containing the sum of total property, crop, and
# both by grouping the dataframe by event.
table_economic <- selected_df %>%
filter(!is.na(PROPDMG) & !is.na(CROPDMG)) %>%
group_by(EVTYPE) %>%
summarise(total_propdmg = sum(PROPDMG2, na.rm = TRUE),
total_cropdmg = sum(CROPDMG2, na.rm = TRUE)) %>%
mutate(total_dmg = total_propdmg + total_cropdmg)
Finally, because of the high number of events, we only chose the 12 top events for each previously prepared table :
top_12_health, the 12 events with the highest number
of fatalities and injuries combined.
top_12_economic, the 12 top events with the highest
economical impact, i.e. highest cost when combining both the property
and crop damage.
# For plotting purpose, we only keep the 12 events with the highest number of
# fatalities/injuries, and the 12 events with the highest economic cost
top_12_health <- head(table_health[order(table_health$total_fatalities_injuries, decreasing = TRUE), ], 12)
top_12_economic <- head(table_economic[order(table_economic$total_dmg, decreasing = TRUE), ], 12)
Those final tables will be used in the Results section to answer the questions of the assignment.
The data is now preprocessed, and we are interested in visualizing and analyzing the Storm data in more details. First, we are interested to know which kind of events are most harmful with respect to population health. To answer this question, a panel plot is displayed below. It contains a first barplot of the total count of fatalities (in red) and a second barplot ot the total count of injuries (in blue) for the top 12 events with the highest number of fatalities and injuries combined.
From the two barplots, one can observe that the events with the highest impact on population health are Tornados by a large margin. Indeed, tornados alone are responsible for 91346 injuries and 5633 fatalities (96979 instances when combining both cases). The next most harmful event is Excessive Heat, with 6525 injuries and 1903 fatalities (8428 combining both), followed by TSTM Winds with 6957 injuries and 504 fatalities (7461 combining both). Other notable events are Floods, Lightning, Heat, Flash Floods, Ice Storms, Thunderstorm Winds, Winter Storms, High Winds and Hail, in this order.
While the event types are displayed in descending order on the x-axis of the barplots by taking into account the combined number of injuries and fatalities, some additional observations can be made by considering only the fatalities or injuries separately. For example, Flash foods, Heat and Lightnings have less Fatalities & Injuries (red & blue combined) than Flood and TSTM Winds. On the other hand, when considering the red bars only, those events are deadlier, their fatalities being higher.
# For plotting purpose, change the table by separating the total fatalities and
# the total injuries, and adding a column "type_of_loss" whichbindicates the
# characteristic of the count (fatalities or injuries)
plot_data <- top_12_health %>%
pivot_longer(cols = c(total_fatalities, total_injuries),
names_to = "type_of_loss",
values_to = "count")
# Reorder the event types in descending order
plot_data$EVTYPE <- reorder(plot_data$EVTYPE, -plot_data$count, FUN = sum)
# Create a panel plot which contains 2 barplots, one for the fatalities (in red)
# and one for the injuries (in blue)
p <- ggplot(plot_data, aes(x = EVTYPE, y = count, fill = type_of_loss)) +
geom_bar(position = "dodge", stat = "identity") +
facet_wrap(~ type_of_loss, scales = "free_y") +
theme_minimal() +
theme_light() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(x = "Event Type", y = "Count",
title = "Fatalities and Injuries by Event Type (Top 12)")
p + scale_y_continuous(sec.axis = sec_axis(~ ., name = "Count"))
Next, we are interested in knowing which events have the greatest economic consequences. To answer this question, a stacking barplot is plotted below, with the cost due to property damage appearing in light blue, and the cost due to crop damage appearing in light salmon. From the graphic, one observes that the most costly type of events are Floods, with a combined cost of approx. 150000 million dollars. The next most costly events are Hurricanes/Typhoons with approx. 72000 million dollars, followed by Tornados (approx. 57000 millions), Storm Surges (approx. 43000 millions), Hail and Flash Floods.
In most cases, property damage costs exceed crop damage costs by a large margin. The only exception are Drought events, where damage costs are almost entirely due to crop damage.
# Reorder the event types in descending order
top_12_economic$EVTYPE <- reorder(top_12_economic$EVTYPE, -top_12_economic$total_dmg)
# Create a stacked barplot, with the costs due to property damage in blue, and the
# costs due to crop damage in red
ggplot(top_12_economic, aes(x = EVTYPE, y = total_dmg/1000000)) +
geom_bar(aes(fill = "Total Property Cost"), stat = "identity") +
geom_bar(aes(y = total_cropdmg/1000000, fill = "Total Crop Cost"), stat = "identity") +
scale_fill_manual(values = c("Total Property Cost" = "lightblue", "Total Crop Cost" = "lightsalmon")) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(x = "Event Type", y = "Total Expenses due to damage \n (in Millions)",
title = "Total Economical Cost by Event Type (Top 12)")
From this analysis, we have identified the storm events in the United States which are the most harmful for the population’s health, as well as the events that are the most economical costly due to damage. Specifically, Tornados have the greatest impact on health, while Floods are responsible for the highest economic damage. Interestingly, the event types do not always overlap, i.e. events responsible for high property/crop damage do not always have a great health impact, and vice versa. But overall, Tornados are the most notable type of events, since they have the highest impact on the population’s health, and they represent also the 3rd most costly event.