Analysis of Storm Data

Synopsis

Severe weather events can have a significant impact on public health and the economy. Understanding which events pose the greatest risks can help policymakers and emergency responders better prepare for and respond to these events.

This analysis attempts to do just that, by exploring the impact of severe weather events on public health and economic consequences in the United States using the NOAA Storm Database.

This analysis focuses on two main questions:

  1. Which events are most harmful to population health?

  2. Which events have the greatest economic consequences?


The data is processed and analyzed using R, with visualizations created to illustrate the findings. The results show which types of severe weather events pose the greatest risks to public safety and economic stability, providing valuable insights for emergency preparedness and resource allocation.

Importing libraries

Before we start the analysis, we need to import the necessary libraries for data processing and visualization.

library(data.table)

Data Processing

Download Data

We first download the data from NOAA Storm Database.

url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(url, destfile = "repdata_data_StormData.csv.bz2")

Load Data

Once we download the data, we can load it into our session

data <- fread("repdata_data_StormData.csv.bz2")

Let us look at the data structure to understand the columns and values.

str(data)
## Classes 'data.table' and 'data.frame':   902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  "" "" "" "" ...
##  $ BGN_LOCATI: chr  "" "" "" "" ...
##  $ END_DATE  : chr  "" "" "" "" ...
##  $ END_TIME  : chr  "" "" "" "" ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  "" "" "" "" ...
##  $ END_LOCATI: chr  "" "" "" "" ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ WFO       : chr  "" "" "" "" ...
##  $ STATEOFFIC: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  "" "" "" "" ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...
##  - attr(*, ".internal.selfref")=<externalptr>

That’s clearly way too many variables. Luckily, we are only interested in a few of them. Let’s extract the relevant columns and clean the data. ## Cleaning The Data

Extracting Required Columns

Since we are only interested in the economic and health impact of the events, we will extract the relevant columns for analysis. The relevant columns are:

  • EVTYPE: Event Type
  • FATALITIES: Number of Fatalities
  • INJURIES: Number of Injuries
  • PROPDMG & PROPDMGEXP: Property Damage
  • CROPDMG & CROPDMGEXP: Crop Damage
cleaned_data <- data[, .(EVTYPE, FATALITIES, INJURIES, PROPDMG, CROPDMG, PROPDMGEXP, CROPDMGEXP)]

Removing NAs

Since some of the data is missing, we will remove rows with missing values to ensure the accuracy of our analysis.

cleaned_data <- cleaned_data[complete.cases(cleaned_data), ]

Adjusting Property Damage Values

Observe that the property and crop damage columns have an exponential multiplier column (PROPDMGEXP and CROPDMGEXP) to indicate the order of magnitude for the damage values.
.So, the true property damage is PROPDMG multiplied bu the exponent as indicated by PROPDMGEXP, and same is true for crop damage.
We will adjust the damage values accordingly.

convert_exp <- function(exp) {
  if (exp %in% c("K", "k")) return(1e3)
  if (exp %in% c("M", "m")) return(1e6)
  if (exp %in% c("B", "b")) return(1e9)
  return(1)
}

cleaned_data[, PROPDMG := PROPDMG * sapply(PROPDMGEXP, convert_exp)]
cleaned_data[, CROPDMG := CROPDMG * sapply(CROPDMGEXP, convert_exp)]

Data Analysis

Finally, we are ready to analyze the data to answer the two main questions we asked at the beginning.

1. Which events are most harmful to population health?

Summarize the data

First, let us extract only the relevant columns for analysis and group the data by event type.

fatalities_data <- cleaned_data[, .(FATALITIES = sum(FATALITIES)), by = EVTYPE]
injuries_data <- cleaned_data[, .(INJURIES = sum(INJURIES)), by = EVTYPE]

Finally, we can order the data by the number of fatalities and injuries to identify the most harmful events.

fatalities_data <- fatalities_data[order(-FATALITIES)]
injuries_data <- injuries_data[order(-INJURIES)]

Aggregate the bottom events as “Others” for easier analysis

Since we want to focus on the top events that cause the most harm we will only look at the top 4 events and aggregate the rest as “Others”.

top_fatalities_data <- fatalities_data[1:4]

# Sum the rest as "Others"
others_fatalities <- sum(fatalities_data[5:.N, FATALITIES])
top_fatalities_data <- rbind(top_fatalities_data, data.table(EVTYPE = "Others", FATALITIES = others_fatalities))

# Same for injuries
top_injuries_data <- injuries_data[1:4]
others_injuries <- sum(injuries_data[5:.N, INJURIES])
top_injuries_data <- rbind(top_injuries_data, data.table(EVTYPE = "Others", INJURIES = others_injuries))

Plot Data

Let us now plot the data to see which events are the most severe in terms of fatalities and injuries.
We have decided to use pie charts for this purpose, as we want to show the distribution of harm across different event types.

par(mfrow = c(1, 2))

# Plot pie chart for fatalities with percentages and reduced font size
fatalities_pct <- round(top_fatalities_data$FATALITIES / sum(top_fatalities_data$FATALITIES) * 100, 1)
pie(top_fatalities_data$FATALITIES, labels = paste(top_fatalities_data$EVTYPE, fatalities_pct, "%"), main = "Fatalities by Event Type", cex = 0.5)

# Plot pie chart for injuries with percentages and reduced font size
injuries_pct <- round(top_injuries_data$INJURIES / sum(top_injuries_data$INJURIES) * 100, 1)
pie(top_injuries_data$INJURIES, labels = paste(top_injuries_data$EVTYPE, injuries_pct, "%"), main = "Injuries by Event Type", cex = 0.5)

Looking at the pie chart it is clear as day that Tornadoes are the most harmful event in terms of fatalities and injuries.
Tornadoes account for 37.2% of total fatalities and 65% of total injuries.
Therefore, it is clear that Tornadoes are the most harmful event in terms of population health.

2. Which events have the greatest economic consequences?

Summarize the data

Similar to the previous analysis, we will summarize the data for property and crop damage by event type and order them according to their economic impact.

property_data <- cleaned_data[, .(PROPDMG = sum(PROPDMG)), by = EVTYPE]
crop_data <- cleaned_data[, .(CROPDMG = sum(CROPDMG)), by = EVTYPE]
property_data <- property_data[order(-PROPDMG)]
crop_data <- crop_data[order(-CROPDMG)]

Aggregate the bottom events as “Others” for easier analysis

Again, we shall only focus on the top 4 events that cause the most economic damage and aggregate the rest as “Others”.

top_property_data <- property_data[1:4]
other_property <- sum(property_data[5:.N, PROPDMG])
top_property_data <- rbind(top_property_data, data.table(EVTYPE = "Others", PROPDMG = other_property))

top_crop_data <- crop_data[1:4]
other_crop <- sum(crop_data[5:.N, CROPDMG])
top_crop_data <- rbind(top_crop_data, data.table(EVTYPE = "Others", CROPDMG = other_crop))

Plot Data

Let us see which events have the greatest economic consequences by plotting the property and crop damage data.

par(mfrow = c(1, 2))

# Plot pie chart for property damage with percentages and reduced font size
property_pct <- round(top_property_data$PROPDMG / sum(top_property_data$PROPDMG) * 100, 1)
pie(top_property_data$PROPDMG, labels = paste(top_property_data$EVTYPE, property_pct, "%"), main = "Property Damage by Event Type", cex = 0.5)

# Plot pie chart for crop damage with percentages and reduced font size
crop_pct <- round(top_crop_data$CROPDMG / sum(top_crop_data$CROPDMG) * 100, 1)
pie(top_crop_data$CROPDMG, labels = paste(top_crop_data$EVTYPE, crop_pct, "%"), main = "Crop Damage by Event Type", cex = 0.5)

Looking at the data, it looks like Floods are the most harmful in terms of property damage, accounting for 33.9% of total property damage and Drought is the most harmful in terms of crop damage, accounting for 28.5% of total crop damage.

Combining Property and Crop Data

While the above analysis is useful, it does not provide the full picture. There are still some questions worth answering.

  • Which causes a bigger economic impact, property damage or crop damageOr are both of a similar scale?
  • Could there be another event that has a bigger harmful impact on the overall economy?

Looking at these questions it doesn’t look like our question has been answered yet.
Since both PROPDMG and CROPDMG measure the economic impact in terms of monetary value, we can combine them to get a better understanding of the overall economic impact of each event.

damage_data <- merge( property_data, crop_data, by = "EVTYPE")
damage_data[, DAMAGE := PROPDMG + CROPDMG]
damage_data <- damage_data[order(-DAMAGE)]
damage_data <- damage_data[, .(EVTYPE, DAMAGE)]

This time, let us focus on the top 7 events instead of the top 4 to have a better understanding of the distribution

top_damage_data <- damage_data[1:7]
other_damage <- sum(damage_data[8:.N, DAMAGE])
top_damage_data <- rbind(top_damage_data, data.table(EVTYPE = "Others", DAMAGE = other_damage))

Plot Data

Let us plot the combined data to see the results.

damage_pct <- round(top_damage_data$DAMAGE / sum(top_damage_data$DAMAGE) * 100, 1)
pie(top_damage_data$DAMAGE, labels = paste(top_damage_data$EVTYPE, damage_pct, "%"), main = "Total Damage by Event Type", cex = 0.5)

Looking at the pie chart, it is clear that Floods are the most harmful, accounting for 31.6% of the total economic impact.

Results

From our analysis, we can conclude that Tornadoes are the most harmful event in terms of population health and **Floods* are most harmful in terms of economic consequences.
So, policymakers and emergency respondents should focus on preparing for and responding to Tornadoes & Floods to minimize the impact on public safety and economic stability.