Synopsis

This report is the result of the Peer Assignment 2 of the “Reproducible Research” course from Johns Hopkins University on Coursera. The goal of the assigment was to explore the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database to determine which types of weather events have the greatest impact on public health and the economy, respectively.

Tornadoes were found to cause by far the most injuries and fatalities. Floods, hurricanes, tornadoes and storm surges are responsible for the highest amounts of property damage, while drought, floods and ice storms cause the worst damages to crops.

Data Processing

We start by loading the required packages for this project, The data from the NOAA database is then loaded into the data frame stormData. This report assumes that the .csv file is located in the working directory, for reasons of loading time.

library(dplyr)
library(ggplot2)
library(gridExtra)
stormData <- read.csv("repdata_data_StormData.csv.bz2")

We can subset the data frame to include the event type, injuries and fatalites, as well as property and crop damage (including their exponents). These are the only columns needed for the further analysis required in this project.

stormData <- stormData %>% 
        select(EVTYPE, INJURIES, FATALITIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)

The EVTYPE column contains a lot of different names for events that are seemingly the same. Looking at the top 20 names we see a potential to combine some of these into the same factor. A good example is the various types of thunderstorm winds.

stormData$EVTYPE <- tolower(stormData$EVTYPE)
head(sort(table(stormData$EVTYPE), decreasing = TRUE), 20)
## 
##                     hail                tstm wind        thunderstorm wind 
##                   288661                   219942                    82564 
##                  tornado              flash flood                    flood 
##                    60652                    54277                    25327 
##       thunderstorm winds                high wind                lightning 
##                    20843                    20214                    15754 
##               heavy snow               heavy rain             winter storm 
##                    15708                    11742                    11433 
##           winter weather             funnel cloud         marine tstm wind 
##                     7045                     6844                     6175 
## marine thunderstorm wind               waterspout              strong wind 
##                     5812                     3796                     3569 
##     urban/sml stream fld                 wildfire 
##                     3392                     2761

I have decided to combine only the instances of thunderstorm winds, as these events are spread over three high-occuring event names and will therefore have an impact on the overall total when combined.

stormData[grep("^thunderstorm.*wind.*|^tstm.*wind.*", stormData$EVTYPE),]$EVTYPE <- c("thunderstorm wind")
head(sort(table(stormData$EVTYPE), decreasing = TRUE), 20)
## 
##        thunderstorm wind                     hail                  tornado 
##                   324696                   288661                    60652 
##              flash flood                    flood                high wind 
##                    54277                    25327                    20214 
##                lightning               heavy snow               heavy rain 
##                    15754                    15708                    11742 
##             winter storm           winter weather             funnel cloud 
##                    11433                     7045                     6844 
##         marine tstm wind marine thunderstorm wind               waterspout 
##                     6175                     5812                     3796 
##              strong wind     urban/sml stream fld                 wildfire 
##                     3569                     3392                     2761 
##                 blizzard                  drought 
##                     2719                     2488

Note that marine thunderstorm winds are a seperate category, divided over two event types (thunderstorm and tstm), so we can clearly see that there is room for further clean up, but not to a degree that will impact the number of events order at the top to a serious degree.

The next processing we need to do is to combine the damage values with their exponents to get the correct damage amounts.

corrDamages <- function(dmg, exp) {
        if(exp %in% c("+"))
                return(dmg*1)
        else if (exp %in% c("0", "1", "2", "3", "4", "5", "6", "7", "8", "9"))
                return(dmg*10)
        else if(exp %in% c("h", "H"))
                return(dmg*100)
        else if(exp %in% c("k", "K"))
                return(dmg*1000)
        else if(exp %in% c("m", "M"))
                return(dmg*1000000)
        else if(exp %in% c("b", "B"))
                return(dmg*1000000000)
        else if(exp %in% c("", "-", "?"))
                return(0)
}

stormData$corrPropDmg <- mapply(corrDamages, stormData$PROPDMG, stormData$PROPDMGEXP)
stormData$corrCropDmg <- mapply(corrDamages, stormData$CROPDMG, stormData$CROPDMGEXP)

Finally we create data frames containing the sum of total injuries, fatalities, property damage and crop damage, ordered by the event type with the largest respective impact.

injuryDf <- data.frame(stormData %>% group_by(EVTYPE) %>%
        summarize(injuries = sum(INJURIES)) %>% arrange(desc(injuries)))
fatalityDf <- data.frame(stormData %>% group_by(EVTYPE) %>%
        summarize(fatalities = sum(FATALITIES)) %>% arrange(desc(fatalities)))
propDmgDf <- data.frame(stormData %>% group_by(EVTYPE) %>%
        summarize(propDmg = sum(corrPropDmg)) %>% arrange(desc(propDmg)))
cropDmgDf <- data.frame(stormData %>% group_by(EVTYPE) %>%
        summarize(cropDmg = sum(corrCropDmg)) %>% arrange(desc(cropDmg)))

Results

Here we will present tables and plots of the top 10 weather events for each type of consequence listed (injuries, fatalities, property damage and crop damage).

We start with a tables and a plot of the top 10 events with regards to injuries and fatalities

head(injuryDf, 10)
##               EVTYPE injuries
## 1            tornado    91346
## 2  thunderstorm wind     9469
## 3              flood     6789
## 4     excessive heat     6525
## 5          lightning     5230
## 6               heat     2100
## 7          ice storm     1975
## 8        flash flood     1777
## 9               hail     1361
## 10      winter storm     1321
head(fatalityDf, 10)
##               EVTYPE fatalities
## 1            tornado       5633
## 2     excessive heat       1903
## 3        flash flood        978
## 4               heat        937
## 5          lightning        816
## 6  thunderstorm wind        709
## 7              flood        470
## 8        rip current        368
## 9          high wind        248
## 10         avalanche        224
g1 <- ggplot(injuryDf[1:10,], aes(x = reorder(EVTYPE, desc(injuries)), y = injuries)) + 
        geom_col() + theme(plot.margin = unit(c(2,1,1,1), "lines"), axis.text.x = element_text(angle=30, hjust=1, vjust=1)) + xlab("Event type") + ylab("Total number of injuries")
g2 <- ggplot(fatalityDf[1:10,], aes(x = reorder(EVTYPE, desc(fatalities)), y = fatalities)) + 
        geom_col() + theme(plot.margin = unit(c(2,1,1,1), "lines"), axis.text.x = element_text(angle=30, hjust=1, vjust=1)) + xlab("Event type") + ylab("Total number of fatalities")
grid.arrange(g1, g2, nrow = 1, top = "Top 10 weather events ordered by caused number of injuries and fatalities")

As we can see, tornadoes are responsible for the vast majority of injuries and fatalities.

Finally we present tables and a plot for the top 10 events with regards to property and crop damage

head(propDmgDf, 10)
##               EVTYPE  propDmg
## 1              flood 1.45e+11
## 2  hurricane/typhoon 6.93e+10
## 3            tornado 5.69e+10
## 4        storm surge 4.33e+10
## 5        flash flood 1.61e+10
## 6               hail 1.57e+10
## 7          hurricane 1.19e+10
## 8  thunderstorm wind 9.75e+09
## 9     tropical storm 7.70e+09
## 10      winter storm 6.69e+09
head(cropDmgDf, 10)
##               EVTYPE  cropDmg
## 1            drought 1.40e+10
## 2              flood 5.66e+09
## 3        river flood 5.03e+09
## 4          ice storm 5.02e+09
## 5               hail 3.03e+09
## 6          hurricane 2.74e+09
## 7  hurricane/typhoon 2.61e+09
## 8        flash flood 1.42e+09
## 9       extreme cold 1.31e+09
## 10 thunderstorm wind 1.22e+09
g1 <- ggplot(propDmgDf[1:10,], aes(x = reorder(EVTYPE, desc(propDmg)), y = propDmg)) + 
        geom_col() + theme(plot.margin = unit(c(2,1,1,1), "lines"), axis.text.x = element_text(angle=30, hjust=1, vjust=1)) + xlab("Event type") + ylab("Total sum of property damages")
g2 <- ggplot(cropDmgDf[1:10,], aes(x = reorder(EVTYPE, desc(cropDmg)), y = cropDmg)) + 
        geom_col() + theme(plot.margin = unit(c(2,1,1,1), "lines"), axis.text.x = element_text(angle=30, hjust=1, vjust=1)) + xlab("Event type") + ylab("Total sum of crop damages")
grid.arrange(g1, g2, nrow = 1, top = "Top 10 weather events ordered by sum of property and crop damages")

Here it is clearly visible that floots, hurricanes, tornadoes and storm surges cause the most property damages, while draught, floods and ice storms cause the most crop damages.