NOAA Storm Data Analysis

knitr::opts_chunk$set(echo = TRUE)

## Load libraries
library(dplyr)
library(ggplot2)
library(gridExtra)
library(sqldf)
library(lubridate)
library(stringr)
options(scipen=999)

Synopsis

Storms and other severe weather events can cause public health and economic problems for communities. In this project we analyze data from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. The data in this database contains characteristics of major storms and weather events in the United States, as well as estimates of any fatalities, injuries, and property damage.

Excessive heat, tornado, and flash flood are found to be the events most harmful to the population health. However, data suggest this is mostly the results of the frequency of this type of events. A closer look reveals than Hurricanes (Typhoon) and Tsunamis are the most harmful based on the number of fatalities and injuries per event.

Flood, Hurricane (Typhoon), and Storm Surge/Tide are events with the higher economic impact to communities, both in total, and in average.

This analysis corresponds to Coursera Reproducible Research Course.

Data Processing

First, we download the NOAA Storm Data and save it into a data frame noaaData. A look into the Data shows plenty of typos in the names of the events recorded. While there should be 48 official types of events, a look at the number of unique events in the database shows 985.

Second, looking at the number of different events recorded per year, most events started to be recorded in 1996. Therefore I decided to take the next two steps to clean the data:

Subset the data to only those records starting in 1996.
Create a table with the official names of the types of events. I will use this to ensure all types of events are included in my final data subset.

## Download dataset
destfile = "noaaStormData.bz2"
if (exists("destfile")==FALSE) {
        fileURL <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
        download.file(fileURL, destfile = "noaaStormData.bz2")
        downloadDate <- date()
}

## Read .csv file
noaaData <- read.csv("./noaaStormData.bz2", header = TRUE)

## Find number of events recorded per year
## Change BGN_DATE variable to date format
noaaData$BGN_DATE <- as.Date(noaaData$BGN_DATE, format = "%m/%d/%Y")
## Create new variable with year of event
noaaData <- cbind(noaaData, eventYear=year(noaaData$BGN_DATE))

## Find the number of events recorded per year
eventsPerYear <- noaaData %>% group_by(eventYear) %>% count()

## Create graph of number of events per year
p1 <- ggplot(eventsPerYear, aes(x=eventYear, y=n)) +
        # Add title and labels
        ggtitle("Number of Events Recorded per Year") +
        # Add lines and points
        geom_line(color="darkblue") + geom_point()
        
p1 + scale_x_continuous(name="Year", breaks = seq(1940,2020,5)) +
        scale_y_continuous(name="Number of Events", breaks = seq(0, 65000, 5000))

### Create table with official event names
officialEventNames <- c("Astronomical Low Tide", "Avalanche", "Blizzard", "Coastal Flood",
                        "Cold/Wind Chill", "Debris Flow", "Dense Fog", "Dense Smoke",
                        "Drought", "Dust Devil", "Dust Storm", "Excessive Heat",
                        "Extreme Cold/Wind Chill", "Flash Flood", "Flood", "Frost/Freeze",
                        "Funnel Cloud", "Freezing Fog", "Hail", "Heat",
                        "Heavy Rain", "Heavy Snow", "High Surf", "High Wind",
                        "hurricane (typhoon)", "ice storm", "lake-effect snow", "lakeshore flood",
                        "lightning", "marine hail", "marine high wind", "marine strong wind", 
                        "marine thunderstorm wind", "rip current", "seiche", "sleet",
                        "stormsurge/tide", "strong wind", "thunderstorm wind", "tornado",
                        "tropical depression", "tropical storm", "tsunami", "volcanic ash",
                        "waterspout", "wildfire", "winter storm", "winter weather")
officialEventNames <- tolower(officialEventNames)

#as.data.frame(officialEventNames)

## Subset the data into a new dataset to keep only records starting in 1996
## All events started to be recorded this year.
## Also, start a new variable eventType as a copy of EVTYPE. We will be making 
## corrections to this variable.
dataSubset <- noaaData %>% 
        filter(eventYear >= 1996) %>%
        select(EVTYPE, eventYear, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP) %>%
        mutate(eventType=as.character(EVTYPE))

#head(dataSubset, 3)

Once I created a subset of the NOAA storm data, I looked at the frequency of the different types of evens recorded. After computing cumulative densities, it is noted that 99% of the data is contained in the first 50 variables.

## Find most common events
eventFreq <- dataSubset %>% 
        select(EVTYPE) %>% 
        group_by(EVTYPE) %>% 
        count() %>% 
        arrange(-n)
## Add cumulative and density values to table above. This will help to filter events.
eventFreq <- cbind(Num=seq(1,dim(eventFreq)[1],1), eventFreq)
eventFreq <- cbind(eventFreq, cumsum=cumsum(eventFreq$n)) # Calculate cumulative frequencies
eventFreq <- cbind(eventFreq, Density=eventFreq$n/sum(eventFreq$n)) # Calculate densities
eventFreq <- cbind(eventFreq, cumDensity=cumsum(eventFreq$Density)) # Calculate cumulative densities

#head(eventFreq)

Applying Corrections

I noted that many variable names were not only typos, but many events were incorrectly logged, sometimes with older names. After looking at the NOAA Storm Data documentation, I created the table below to make corrections to some variables found among the most frequent events. I apply the corrections to the eventType variable.

## Corrections after visual inspection:
corrections <- c("Wind","High Wind",
                 "TSTM Wind", "Marine Thunderstorm Wind",
                 "Thunderstorm Winds","Thunderstorm Wind",
                 "Marine TSTM Wind","Marine Thunderstorm Wind",
                 "Urban/SML Stream Fld","Heavy Rain",
                 "High Winds","High Wind",
                 "Wild/Forest Fire","Wildfire",
                 "Winter Weather/Mix","Winter Weather",
                 "TSTM Wind/Hail","Marine Thunderstorm Wind",
                 "Flash Flooding","Flash Flood",
                 "Extreme Cold","Extreme Cold/Wind Chill",
                 "Flood/Flash Flood","Flood",
                 "Landslide","Debris Flow",
                 "Snow","Heavy Snow",
                 "Fog","Dense Fog",
                 "Rip Currents","Rip Current",
                 "Storm Surge", "Storm Surge/Tide",
                 "Astronomical High Tide", "High Surf",
                 "small hail", "hail",
                 "gusty winds", "strong wind",
                 "freeze", "frost/freeze",
                 "wintry mix", "winter weather",
                 "hurricane/typhoon", "hurricane (typhoon)",
                 "moderate snowfall", "winter weather",
                 "river flood", "flood",
                 "unseasonably warm", "heat",
                 "record warmth", "heat",
                 "coastal flooding", "coastal flood",
                 "dry microburst", "thunderstorm wind",
                 "light snow", "winter storm",
                 "freezing rain", "winter weather",
                 "strong winds","strong wind",
                 "extreme windchill","extreme cold/wind chill",
                 "heavy surf/high surf", "high surf")
corrections <- t(matrix(corrections,nrow=2))
corrections <- data.frame(old=corrections[,1], new=corrections[,2])


## Apply corrections from table above
for (i in 1:dim(corrections)[1]) {
        NameMatches <- which(tolower(dataSubset$EVTYPE)==tolower(corrections[i,"old"]))
        dataSubset[NameMatches,"eventType"] <- as.character(corrections[i,"new"])
}


##Extra Individual Corrections after second screening
dataSubset$eventType[grep("tstm", dataSubset$eventType)] <- "marine thunderstorm wind"
dataSubset$eventType[grep("extreme ", dataSubset$eventType)] <- "extreme cold/wind chill"
dataSubset$eventType[grep("blizz", dataSubset$eventType)] <- "blizzard"
dataSubset$eventType[grep("hurricane", dataSubset$eventType)] <- "hurricane (typhoon)"


## Make all events lower case (for consistency) and create factor with the new variable "eventType"
dataSubset$eventType <- tolower(dataSubset$eventType)
dataSubset$eventType <- factor(dataSubset$eventType)
#levels(dataSubset$eventType)

Once corrections have been made, I create a second table with events frequencies, as well as the cumulative densities. After the corrections, 99.7% of the data is concentrated in the top 40 types events. I create a final subset where I make sure the top 40 most frequent events are included, as well as official events as recorded in the officialEventNames table.

Also, I simplify the table by keeping only the variables that will be needed for the analysis:

eventType
Fatalities
Injuries
Property Damage (adjusted to show amounts in dollars)
Crop Damage (adjusted to show amoung in dollars)

eventFreq2 <- dataSubset %>% 
        select(eventType, FATALITIES, INJURIES) %>% 
        group_by(eventType) %>% 
        summarise(n=n(), 
                  fatalities_total=sum(FATALITIES), 
                  injuries_total=sum(INJURIES)) %>% 
        arrange(-n)
eventFreq2 <- cbind(Num=seq(1,dim(eventFreq2)[1],1), eventFreq2)
eventFreq2 <- cbind(eventFreq2, cumsum=cumsum(eventFreq2$n)) # Calculate cumulative frequencies
eventFreq2 <- cbind(eventFreq2, Density=eventFreq2$n/sum(eventFreq2$n)) # Calculate densities
eventFreq2 <- cbind(eventFreq2, cumDensity=cumsum(eventFreq2$Density)) # Calculate cumulative densities


## Create a list of the most common events. I decided to focus on those where we
## concentrate 99.7% of the data
topList <- subset(eventFreq2, cumDensity<=.997)$eventType


## Subset again the data to keep only those events that are part of the most common events AND
## also all those events that are in the officialEventsNames table, that way we ensure all official
## events have been considered
finalSubset <- dataSubset[which(dataSubset$eventType %in% topList | 
                                        dataSubset$eventType %in% officialEventNames),]
# Drop unused levels
finalSubset$eventType <- droplevels(finalSubset$eventType)
finalSubset$EVTYPE <- droplevels(finalSubset$eventType)


##-------------------------------------------------------------------------------------------------------------
## Create healthTotal variable, which is the sum of fatalities plus injuries
## Create propertyDamage variable, which is the product of PROPDMG times the value of PROPDMGEXP (K=10^3, M=10^6, B=10^9)
## Create cropDamage variable, which is the product of CROPDMG times the value of CROPDMGEXP (K=10^3, M=10^6, B=10^9)
finalSubset <- finalSubset %>% 
        mutate(total=FATALITIES+INJURIES,
               propertyDamage=PROPDMG*ifelse(finalSubset$PROPDMGEXP=="K",10^3,
                                             ifelse(finalSubset$PROPDMGEXP=="M",10^6, 10^9)),
               cropDamage=CROPDMG*ifelse(finalSubset$CROPDMGEXP=="K",10^3,
                                         ifelse(finalSubset$CROPDMGEXP=="M",10^6, 10^9)))
## Keep only the necessary variables (eventType, year, fatalities, injuries, total, propertyDamage, cropDamage)
finalSubset <- finalSubset %>%
        select(eventType, eventYear, FATALITIES, INJURIES, total, propertyDamage, cropDamage)

With a final subset table created, I create two more tables, each intended to be used to answer the following questions:

Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?

##-------------------------------------------------------------------------------------------------------------
## Create table with population health
## Group by eventType, and sort in descending order by number of fatalities and injuries
populationHealth <- finalSubset %>%
        group_by(eventType) %>% 
        summarise(freq=n(),
                  fatalities=sum(FATALITIES),
                  fatalitiesAvg=mean(FATALITIES),
                  fatalitiesStd=sd(FATALITIES),
                  injuries=sum(INJURIES),
                  injuriesAvg=mean(INJURIES),
                  injuriesStd=sd(INJURIES),
                  all=sum(total),
                  allAvg=mean(total),
                  allStd=sd(total)) %>%
        arrange(-fatalities, -injuries)

##-------------------------------------------------------------------------------------------------------------
## Create table with economic impact
## Group by eventType, and sort in descending order by number of fatalities and injuries
economicImpact <- finalSubset %>%
        group_by(eventType) %>% 
        summarise(freq=n(),
                  propertyTotal=sum(propertyDamage)/1000000,
                  propdmgAvg=mean(propertyDamage)/1000000,
                  cropTotal=sum(cropDamage)/1000000,
                  cropdmgAvg=mean(cropDamage)/1000000) %>%
        arrange(-propertyTotal)

Results

The figures below show that excessive heat, tornado, and flash flood are found to be the events most harmful to the population health. Data suggest this is due to the frequency of this type of events. Hurricanes (Typhoon) and Tsunamis are the most harmful based on the number of fatalities and injuries per event.