Synopsis

Storms and other weather events are responsible for heavy economic consequences, but also represent dangerous risks for the population itself, causing injuries and even fatalities in the most severe cases. Gaining a better understanding of the risks linked to specific weather events is a key step in prevention for future catastrophies. In this assignment, the Storm Database collected by the U.S. National Oceanic and Atmospheric Administration (NOAA) will be explored. This particular database tracks information of major weather events that occurred in the United States between 1950 and 2011, such as their dates, their geographical locations and estimates of any fatalities, injuries and/or property damage. Using this data, which will be preprocessed and analyzed, the questions of property damage and population’s health related to weather events will be touched on in this report.

Data Processing

We start by loading the R packages that will be used in this assignment.

# Load the needed packages for the assignment
library(dplyr)
library(ggplot2)
library(tidyr)

Then, we load the Storm data via the link given in the assignment, and save the bz2 file under repdata_data_StormData.csv.bz2. If a pre-existing file with this name is already present in the directory, this step is skipped. The full raw table is then read into the raw_table variable.

# Load the data from the url if the data folder does not already exist in the correct folder
data_folder_name <- "repdata_data_StormData.csv.bz2"

if (!file.exists(data_folder_name)) {
      data_url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
      download.file(data_url, data_folder_name, method="curl")
}

# Read the data into a table
raw_table <- read.table("repdata_data_StormData.csv.bz2", sep=",", header=TRUE)

We take a first look on the data. The table’s dimension is (902297, 37), and the feature names of the 37 columns are listed below, with the first 3 samples of the table.

# Check the dimensions of the datable, and the 3 first samples present in the table
cat("The dimension of the raw table is:", dim(raw_table))
## The dimension of the raw table is: 902297 37
head(raw_table, 3)
##   STATE__          BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE  EVTYPE
## 1       1 4/18/1950 0:00:00     0130       CST     97     MOBILE    AL TORNADO
## 2       1 4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL TORNADO
## 3       1 2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL TORNADO
##   BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## 1         0                                               0         NA
## 2         0                                               0         NA
## 3         0                                               0         NA
##   END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG
## 1         0                      14.0   100 3   0          0       15    25.0
## 2         0                       2.0   150 2   0          0        0     2.5
## 3         0                       0.1   123 2   0          0        2    25.0
##   PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE
## 1          K       0                                         3040      8812
## 2          K       0                                         3042      8755
## 3          K       0                                         3340      8742
##   LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1       3051       8806              1
## 2          0          0              2
## 3          0          0              3

Not all features are needed for the demanded tasks. Therefore, only the relevant columns will be kept for the rest of this assignment. The columns are chosen in the selected_column array, and they concern the following features:

The chosen columns are saved in selected_df.

# Select the columns that are relevant for the analysis and create a new dataframe
# containing only those columns
selected_columns <- c("EVTYPE","FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")
selected_df <- raw_table %>%
      select(all_of(selected_columns))

As described above, the PROPDMGEXP and CROPDMGEXP columns contain the order of magnitude of the value saved in PROPDMG and CROPDMG. More specifically, they can contain one of the following letters : h for hundred, k for thousand, m for million, or b for billion. To prepare the data for plotting, the values saved in PROPDMG and CROPDMG are therefore multiplied by a factor of \(10^2\), \(10^3\), \(10^6\), \(10^9\) depending on their orders of magnitude. If no order of magnitude is indicated, the value is kept as it is. This operation is done below. In addition, to take into account both uppercase and lowercase letters, all the letters are transformed into lowercase before applying the factor.

The new price values are saved in the columns PROPDMG2 and CROPDMG2.

# Multiply the costs with the right order of magnitude factor and save it in
# a new column
selected_df <- selected_df %>%
      mutate(CROPDMGEXP = tolower(CROPDMGEXP)) %>%
      mutate(PROPDMGEXP = tolower(PROPDMGEXP)) %>%
      mutate(PROPDMG2 = case_when(
            grepl("h", PROPDMGEXP, fixed = TRUE) ~ PROPDMG * 1e2,
            grepl("k", PROPDMGEXP, fixed = TRUE) ~ PROPDMG * 1e3,
            grepl("m", PROPDMGEXP, fixed = TRUE) ~ PROPDMG * 1e6,
            grepl("b", PROPDMGEXP, fixed = TRUE) ~ PROPDMG * 1e9,
            TRUE ~ PROPDMG  
            )) %>%
      mutate(CROPDMG2 = case_when(
            grepl("h", CROPDMGEXP, fixed = TRUE) ~ CROPDMG * 1e2,
            grepl("k", CROPDMGEXP, fixed = TRUE) ~ CROPDMG * 1e3,
            grepl("m", CROPDMGEXP, fixed = TRUE) ~ CROPDMG * 1e6,
            grepl("b", CROPDMGEXP, fixed = TRUE) ~ CROPDMG * 1e9,
            TRUE ~ CROPDMG  ))

After this first preprocessing step, the total sum of fatalities and injuries by event type is calculated below. A new table table_health is created for this purpose, containing 4 variables :

Rows which have a NA value in either the INJURIES column or FATALITIES column are ignored.

# Create a new dataframe containing the sum of total fatalities, injuries, and 
# both by grouping the dataframe by event. 
table_health <- selected_df %>% 
      filter(!is.na(FATALITIES) & !is.na(INJURIES)) %>%
      group_by(EVTYPE) %>% 
      summarise(total_fatalities = sum(FATALITIES, na.rm = TRUE),
                total_injuries = sum(INJURIES, na.rm = TRUE),
                total_fatalities_injuries = total_fatalities + total_injuries) %>%
      arrange(desc(total_fatalities_injuries))

A similar table, called table_economic, is prepared for the economic cost of an event. It contains the 4 following variables:

# Create a new dataframe containing the sum of total property, crop, and 
# both by grouping the dataframe by event. 
table_economic <- selected_df %>%
      filter(!is.na(PROPDMG) & !is.na(CROPDMG)) %>%
      group_by(EVTYPE) %>%
      summarise(total_propdmg = sum(PROPDMG2, na.rm = TRUE),
                total_cropdmg = sum(CROPDMG2, na.rm = TRUE)) %>%
      mutate(total_dmg = total_propdmg + total_cropdmg)

Finally, because of the high number of events, we only chose the 12 top events for each previously prepared table :

# For plotting purpose, we only keep the 12 events with the highest number of
# fatalities/injuries, and the 12 events with the highest economic cost
top_12_health <- head(table_health[order(table_health$total_fatalities_injuries, decreasing = TRUE), ], 12)
top_12_economic <- head(table_economic[order(table_economic$total_dmg, decreasing = TRUE), ], 12)

Those final tables will be used in the Results section to answer the questions of the assignment.

Results

The data is now preprocessed, and we are interested in visualizing and analyzing the Storm data in more details. First, we are interested to know which kind of events are most harmful with respect to population health. To answer this question, a panel plot is displayed below. It contains a first barplot of the total count of fatalities (in red) and a second barplot ot the total count of injuries (in blue) for the top 12 events with the highest number of fatalities and injuries combined.

From the two barplots, one can observe that the events with the highest impact on population health are Tornados by a large margin. Indeed, tornados alone are responsible for 91346 injuries and 5633 fatalities (96979 instances when combining both cases). The next most harmful event is Excessive Heat, with 6525 injuries and 1903 fatalities (8428 combining both), followed by TSTM Winds with 6957 injuries and 504 fatalities (7461 combining both). Other notable events are Floods, Lightning, Heat, Flash Floods, Ice Storms, Thunderstorm Winds, Winter Storms, High Winds and Hail, in this order.

While the event types are displayed in descending order on the x-axis of the barplots by taking into account the combined number of injuries and fatalities, some additional observations can be made by considering only the fatalities or injuries separately. For example, Flash foods, Heat and Lightnings have less Fatalities & Injuries (red & blue combined) than Flood and TSTM Winds. On the other hand, when considering the red bars only, those events are deadlier, their fatalities being higher.

# For plotting purpose, change the table by separating the total fatalities and
# the total injuries, and adding a column "type_of_loss" whichbindicates the 
# characteristic of the count (fatalities or injuries) 
plot_data <- top_12_health %>%
      pivot_longer(cols = c(total_fatalities, total_injuries),
                   names_to = "type_of_loss",
                   values_to = "count")

# Reorder the event types in descending order
plot_data$EVTYPE <- reorder(plot_data$EVTYPE, -plot_data$count, FUN = sum)

# Create a panel plot which contains 2 barplots, one for the fatalities (in red)
# and one for the injuries (in blue)
p <- ggplot(plot_data, aes(x = EVTYPE, y = count, fill = type_of_loss)) +
      geom_bar(position = "dodge", stat = "identity") +
      facet_wrap(~ type_of_loss, scales = "free_y") +
      theme_minimal() +
      theme_light() +
      theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
      labs(x = "Event Type", y = "Count",
           title = "Fatalities and Injuries by Event Type (Top 12)")
p + scale_y_continuous(sec.axis = sec_axis(~ ., name = "Count"))

Next, we are interested in knowing which events have the greatest economic consequences. To answer this question, a stacking barplot is plotted below, with the cost due to property damage appearing in light blue, and the cost due to crop damage appearing in light salmon. From the graphic, one observes that the most costly type of events are Floods, with a combined cost of approx. 150000 million dollars. The next most costly events are Hurricanes/Typhoons with approx. 72000 million dollars, followed by Tornados (approx. 57000 millions), Storm Surges (approx. 43000 millions), Hail and Flash Floods.

In most cases, property damage costs exceed crop damage costs by a large margin. The only exception are Drought events, where damage costs are almost entirely due to crop damage.

# Reorder the event types in descending order
top_12_economic$EVTYPE <- reorder(top_12_economic$EVTYPE, -top_12_economic$total_dmg)

# Create a stacked barplot, with the costs due to property damage in blue, and the
# costs due to crop damage in red
ggplot(top_12_economic, aes(x = EVTYPE, y = total_dmg/1000000)) +
  geom_bar(aes(fill = "Total Property Cost"), stat = "identity") +
  geom_bar(aes(y = total_cropdmg/1000000, fill = "Total Crop Cost"), stat = "identity") +
  scale_fill_manual(values = c("Total Property Cost" = "lightblue", "Total Crop Cost" = "lightsalmon")) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(x = "Event Type", y = "Total Expenses due to damage \n (in Millions)",
       title = "Total Economical Cost by Event Type (Top 12)")

From this analysis, we have identified the storm events in the United States which are the most harmful for the population’s health, as well as the events that are the most economical costly due to damage. Specifically, Tornados have the greatest impact on health, while Floods are responsible for the highest economic damage. Interestingly, the event types do not always overlap, i.e. events responsible for high property/crop damage do not always have a great health impact, and vice versa. But overall, Tornados are the most notable type of events, since they have the highest impact on the population’s health, and they represent also the 3rd most costly event.