An analysis of NOAA storm data events with respect to Population Health and Economic Impact

Neil Kutty

Synopsis

This project determines the most damaging Storm events from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database with respect to Population Health and Economic Consequences. The data file is downloaded from the course website and processed using dplyr. Prior to the data frame processing with dplyr, for the analysis of Economic health impacts, the PROPDMGEXP and CROPDMGEXP variables from the original dataset are converted into usable multipliers from codes using the NOAA documentation.

The analysis determines the event with the greatest Population Health impact is Tornado and the event with the greatest Economic impact is Flood. The full results and analysis are provided below.

Data Processing

Load libraries for analysis.

library(dplyr)
library(tidyr)
library(ggplot2)

Downloading and reading in the data

We first see if the file exists in the current working directory. If not, the file is downloaded and saved as filename 2FStormData.csv.bz2.

The next step is to read in the datset into a data frame. Because the set is so large, we only get a subset of the available columns. Primarily those related to the Event Type, Population Health impact, and Economic impact.

note: in the below chunk, cache is set to TRUE

if(!file.exists('2FStormData.csv.bz2')){
    download.file(url = 'https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2',
                  destfile = '2FStormData.csv.bz2')
    }else{
    datLite <- read.csv('2FStormData.csv.bz2', stringsAsFactors = FALSE)[,c('BGN_DATE','BGN_TIME','TIME_ZONE','STATE',
                                                  'EVTYPE','LENGTH','WIDTH','MAG','F',
                                                  'FATALITIES','INJURIES','PROPDMG',
                                                  'PROPDMGEXP','CROPDMG','CROPDMGEXP',
                                                  'LATITUDE','LONGITUDE')]
    }

Deriving Multiplier Columns

The next step is to create a dataframe grouped by the EVTYPE variable with relevant counts and summary values.

As a cleanup step, we have to calculate the multiplier columns for Property and Crop Damage, PROPDMGEXP and CROPDMGEXP, and then apply the results to the PROPDMG and CROPDMG values respectively in order to derive the proper multiplier with which to derive adjusted values with.

note: in the below chunk, cache is set to TRUE

#Create a function to calculate the multiplier columns to adjust PROPDMG and CROPDMG values

AdjExp <- function(x){
    if(x %in% c('','-','?','+','0'))
        return(1)
    else if(x %in% c('b','B'))
        return(1000000000)
    else if(x %in% c('h','H'))
        return(100)
    else if(x %in% c('k','K'))
        return(1000)
    else if(x %in% c('m','M'))
        return(1000000)
    else if(is.numeric(as.numeric(x)))
        return(10^as.numeric(x))
    else
        return(1)
}

datLite$PROPDMGADJ <- sapply(datLite$PROPDMGEXP, AdjExp)
datLite$CROPDMGADJ <- sapply(datLite$CROPDMGEXP, AdjExp)

Creating Summary Data Set

Next we create a dataframe that groups by EVTYPE (Event Type) and summarizes the measurements of Population Health and Economic Health. In the summarize function, we multiply the original PROPDMG and CROPDMG to the newly created multipliers PROPDMGADJ and CROPDMGADJ respectively. We call the final value PropDamage and CropDamage in our summary Events table.

# create a dataset tabulating Event Types (EVTYPE) and relevant variables
Events <- datLite %>%
    select(EVTYPE,FATALITIES,INJURIES,PROPDMG,CROPDMG,PROPDMGADJ,CROPDMGADJ) %>%
    group_by(EVTYPE) %>%
    summarize(Count = n(),
              TotalFatalities = sum(FATALITIES),
              TotalInjuries = sum(INJURIES),
              TotalIncidents = TotalFatalities + TotalInjuries,
              PropDamage = sum(PROPDMG*PROPDMGADJ),
              CropDamage = sum(CROPDMG*CROPDMGADJ),
              TotalDamage = PropDamage + CropDamage) %>%
    arrange(desc(TotalFatalities))

Results

Question 1

Across the United States, which types of events are most harmful with respect to population health?

To answer this question, we first subset the Events data frame we created in the data processing step above. Here, we only select the variables related to population health (those related to fatalities and injuries).

The next step is to filter for those Events that exist in the top 25%. We don’t need every event and summary data, so we only get this top slice for greater efficiency.

Finally we arrange the dataframe descending according to the TotalIncidents column which sums the Fatalities and Injuries amounts per Event type.

PopHealth <- Events %>%
    select(EVTYPE,Count,TotalFatalities,
           TotalInjuries,TotalIncidents) %>%
    filter(TotalIncidents > quantile(TotalIncidents,.75)) %>%
    arrange(desc(TotalIncidents))

Answer:

The Event Type with the greatest harm to population health is `TORNADO`

ggplot(data = PopHealth[1:5,], aes(x=reorder(EVTYPE,-TotalIncidents), y=TotalIncidents)) +
    geom_bar(stat="identity", alpha=0.75, color='blue', fill='blue') +
    xlab("Event Type")+
    ylab("Total Population Health Incidents")+
    ggtitle("Top Five Events by Total Fatalities and Injuries Caused")+
    geom_text(aes(label=PopHealth[1:5,]$TotalIncidents), size = 5.5, vjust = .15, color = "black") +
    theme(axis.title=element_text(size=10),
          axis.text.x = element_text(face = 'bold', size = 12, angle = 45, hjust = 1))

Question 2

Across the United States, which types of events have the greatest economic consequences?

To answer this question, we first subset the Events data frame we created in the data processing step above. Here, we only select the variables related to Economic Health (those related to property damage and crop damage).