A Brief Analysis of the NOAA Storm Database: Casualties and Economic Losses

Fernando Rodriguez
1/13/2018

Synopsis

This work outlines the data processing steps and subsequent analysis to determine personal casualties (defined as injuries and deaths) and economic impact from natural disasters in the US. The source data is from the NOAA storm database, in this instance gathered from 1950 and ending in November 2015.

The entries in the NOAA database are not consistent in the descriptions for weather events, thus some similar words were grouped together to create larger classifications (e.g. heat, hot, high temperatures) that are more descriptive of a larger trend. These would otherwise be hidden within the data and would lead us to incorrect assumptions.

The analysis is written clearly to guide the reader to understand the data sources, processing steps, and analysis. This is to allow the reader to understand the reasons behind our conclusions and invite them to perform a similar analysis and arrive to similar conclusions.

Data Processing

This section describes the steps taken to process the data for later analysis.

Libraries

This analysis will included the dplyr, lubridate, and ggplot2 packages to deal easily with data, dates, and plots, respectively.

Downloading and Loading Files into R

The code below will read the file from the given URL and will write it to the destination file. The first if loop checks if the file is already in the directory (it is very large) and does only proceeds to download it if it is missing. The nested if loop checks if the storm.data variable is in the environment, if it is, it does not read it in (again, very large size)

url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
destfile <- "stormdata.csv.bz2"

# Checks if file exists in the directory, if false, proceeds to download it
if(!file.exists(destfile)) {
    download.file(url = url, 
                  destfile = destfile, 
                  method = "curl")
}

if(!("storm.data" %in% ls())){
    storm.data <- read.csv(destfile, 
                           header = TRUE)
}

# Here a data frame is created from the data read above
storm.data <- tbl_df(storm.data)

In this step, we look at the EVTYPE column, which describes the weather event. As consistency is not the norm in this column, the code looks for keywords in this column and created a new column, where a summary entry is made for that record. For example, if we find the words "heat," "warm," or "hot," we will write "Heat" in a new column, as these entries are all related to heat events.

# In future iterations, write a function that eliminates the brute force cleanup below

# Looking for heat-related keywords
heat.log <- grepl("heat|warm|high temperature|hot", 
                  storm.data$EVTYPE, 
                  ignore.case = TRUE)
storm.data$EVTYPE2[heat.log] <- "Heat"

# Looking for thunderstorm-related keywords
thunderstorm.log <- grepl("tstm|thunderstorm", 
                          storm.data$EVTYPE, 
                          ignore.case = TRUE)
storm.data$EVTYPE2[thunderstorm.log] <- "Thunderstorm"

# Looking for flood-related keywords
flood.log <- grepl("flood", 
                   storm.data$EVTYPE, 
                   ignore.case = TRUE)
storm.data$EVTYPE2[flood.log] <- "Flood"

# Looking for cold-related keywords
cold.log <- grepl("cold|chill|low|ice|winter|freez", 
                  storm.data$EVTYPE, 
                  ignore.case = TRUE)
storm.data$EVTYPE2[cold.log] <- "Cold"

# Looking for hail-related keywords
hail.log <- grepl("hail", 
                  storm.data$EVTYPE, 
                  ignore.case = TRUE)
storm.data$EVTYPE2[hail.log] <- "Hail"

# Looking for tornado-related keywords
tornado.log <- grepl("tornado", 
                     storm.data$EVTYPE, 
                     ignore.case = TRUE)
storm.data$EVTYPE2[tornado.log] <- "Tornado"

# Looking for hurricane-related keywords
hurricane.log <- grepl("hurricane", 
                       storm.data$EVTYPE, 
                       ignore.case = TRUE)
storm.data$EVTYPE2[hurricane.log] <- "Hurricane"

# Looking for tropical storm-related keywords
tropical.log <- grepl("tropical", 
                      storm.data$EVTYPE, 
                      ignore.case = TRUE)
storm.data$EVTYPE2[tropical.log] <- "Tropical Storm"

# All other items are classified as "Other" if there is no entry in EVTYPE2
other.log <- is.na(storm.data$EVTYPE2)
storm.data$EVTYPE2[other.log] <- "Other"

Results

Question 1

Across the United States, which types of events (as indicated in the π™΄πš…πšƒπšˆπ™Ώπ™΄ variable) are most harmful with respect to population health?

Answer 1

We begin our analysis by grouping the dataset into the general weather types we created above and then counting the fatalities and injuries for all weather events. The casualties are then calculated as the sum on the injuries and fatalities.

# Grouping by general weather type, then finding the total casualties by adding total fatalities and total
# injuries, then the results are sorted in descending order by number of casualties

storm.evtype <- storm.data %>%
        group_by(EVTYPE2) %>%
        summarize(fatalities = sum(FATALITIES), 
              injuries = sum(INJURIES), 
              casualties = fatalities + injuries) %>%
        arrange(desc(casualties)) 

# We then proceed to create a plot of the casualties sorted in descending order
plot.evtype <- ggplot(data = storm.evtype, 
                      aes(x = reorder(EVTYPE2, casualties), 
                          y = casualties)
                      )

plot.evtype + geom_bar(stat = "identity", 
                       alpha = 2/3, 
                       fill = "navy") + 
    coord_flip() +
    xlab("Weather Event") +
    ylab("Number of Casualties (Injuries and Fatalities)") +
    ggtitle("NOAA Storm Database:  Casualties by Weather Event Type",
            subtitle = "Data from 1950 to 2011")

From the chart above, we see that tornadoes are the most harmful weather events with respect to population health in this dataset.

Question 2

Across the United States, which types of events have the greatest economic consequences?

Answer 2

We begin our analysis by grouping the dataset into the general weather types we created above and then reducing the size of the dataset to be considered. We accomplish this by filtering records with zero damages. We then examine the damage exponents. We then assign numerical multipliers to these (e.g. "B" is billion, "K" is thousands, etc.).

We then multiply the damages by the damage exponents and find the sum of the property and crop damages to find the total damages for each event, and create a plot to illustrate these damages in descending order.

# Cleaning up the data by eliminating records where the economic damage is listed as 0
storm.econ.clean <- storm.data %>% 
    filter(PROPDMG != 0 & CROPDMG != 0) %>%
    mutate(PROPDMGEXP = toupper(PROPDMGEXP), CROPDMGEXP = toupper(CROPDMGEXP)) 
# We change the damage exponents for property and crops to all caps to make them easier to process later


# Find the unique values of the exponents, which come in various forms to assign them numerical values
unique.exps <- unique(c(unique(storm.econ.clean$PROPDMGEXP), unique(storm.econ.clean$CROPDMGEXP)))
multiplier <- c(10^9, 10^6, 10^3, 10^5, 1, 1, 10^3)

# Create a lookup dataframe with the exponents and the multipliers
unique.multipliers <- data.frame(Unique = unique.exps, Multiplier = multiplier)

# Merge the clean storm dataset with the crop and property damage exponents
storm.econ.clean.mult1 <-  merge(storm.econ.clean, unique.multipliers, by.x = "CROPDMGEXP", by.y = "Unique")
storm.econ.clean.mult2 <- merge(storm.econ.clean.mult1, unique.multipliers, by.x = "PROPDMGEXP", by.y = "Unique")

# Group by weather event type, then calculate damages by multiplying the damage values by the numerical exponents, # then finding the total damages by adding property damages and crop damages together.  The values are then sorted # in descending order.
storm.damages <- storm.econ.clean.mult2 %>% 
    group_by(EVTYPE2) %>%
    mutate(crop.damages = CROPDMG * as.numeric(Multiplier.x), 
           prop.damages = PROPDMG * as.numeric(Multiplier.y), 
           tot.damages = crop.damages + prop.damages) %>% 
    summarize(crop.damages  = sum(crop.damages), 
              prop.damages = sum(prop.damages), 
              tot.damages = sum(crop.damages,prop.damages)) %>%
    arrange(desc(tot.damages))

# Values are plotted in descending order in ggplot2
plot.damages <- ggplot(data = storm.damages, aes(x = reorder(EVTYPE2, tot.damages), y = tot.damages))

plot.damages + geom_histogram(stat = "identity",
                              alpha = 2/3,
                              fill = "red") + 
    coord_flip() +
    xlab("Weather Event") +
    ylab("Economic Damages in US Dollars (Crop and Property Damages)") + 
    ggtitle("NOAA Storm Database:  Economic Damages by Weather Event Type",
            subtitle = "Data from 1950 to 2011")

The plot shows that flooding is a major source of economic damage in the US, followed by hurricanes.