Synopsis

In this document we will investigate the effects that major weather events have across the United States. In particular, we will examine the effects on the population (measured in number of injuries and fatalities per event) and the economy (measured as the damage done to property and crops, in dollars). We will group the data into one of 48 different weather event categories, as defined by NOAA, and then summarise the data.

Data Processing

We begin by downloading the data and reading it into a table. The download.file

setwd('~/Coursera/DS Repo/RepResearch/Assignment2/')
## Download is commented out so we don't repeatedly download
## download.file('https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2',
##              'StormData.csv.bz2')

## read.csv can handle the compressed file so we just read it in
newData <- read.csv('StormData.csv.bz2')

For our analysis we are only interested in the data on event type, injuries, fatalities, and the amount of damage done, so we filter out the other columns. The dplyr package makes this easy, and will be useful later on.

library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
newData <- select(newData, EVTYPE, FATALITIES, INJURIES, 
                    PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)

Let’s take a quick look at the data:

str(newData)
## 'data.frame':    902297 obs. of  7 variables:
##  $ EVTYPE    : Factor w/ 985 levels "?","ABNORMALLY DRY",..: 830 830 830 830 830 830 830 830 830 830 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...

The event type variable has 985 categories. What a mess. NOAA lists 48 different categories the weather events should fall into, so we need to do some fixing to make the data similar to what we expect it to be.

We start by finding out how many entries each of these 985 factors has in our dataset. The count function in plyr makes this easy, then we remove plyr so that it does not interact with dplyr later on.

library(plyr)
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
## 
## Attaching package: 'plyr'
## 
## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
evCounts <- arrange(count(newData$EVTYPE), desc(freq))
detach(package:plyr)
evCounts[c(1,50),]
##              x   freq
## 1         HAIL 288661
## 50 STORM SURGE    261
percent <- summarise(evCounts[1:50,],sum(freq))/nrow(newData)

Let’s look at just the top 50 categories. Note that 99.1283358 percent of all of our observations are contained in these 50. Also note that the difference between the first and 50th value is three orders of magnitude. We are interested in finding the event with the greatest total effect, so it is highly unlikely that including the remaining 935 event types will change our results. To simplify our data set, we will use only these top 50 categories. Some of these category names can be combined (combining plurals into one single category, for example) so we will use gsub to combine the similar categories. The categories created are not verbatim the ones proposed by NOAA, but they are representative of those categories.

evCounts <- evCounts[1:50,]
newData <- filter(newData, EVTYPE %in% evCounts$x)
newData$EVTYPE <- gsub('TSTM', 'THUNDERSTORM', newData$EVTYPE)
newData$EVTYPE <- gsub('WINDS', 'WIND', newData$EVTYPE)
newData$EVTYPE <- gsub('CURRENTS', 'CURRENT', newData$EVTYPE)
newData$EVTYPE <- gsub('FLOODING', 'FLOOD', newData$EVTYPE)
newData$EVTYPE <- gsub('FLOOD/', '', newData$EVTYPE)
newData$EVTYPE <- gsub('/FOREST ', '', newData$EVTYPE)
newData$EVTYPE <- gsub('URBAN/SML STREAM FLD', 'FLOOD', newData$EVTYPE)
newData$EVTYPE <- gsub('/MIX', '', newData$EVTYPE)
newData$EVTYPE <- gsub('/WIND CHILL', '', newData$EVTYPE)
newData$EVTYPE <- gsub('DENSE FOG', 'FOG', newData$EVTYPE)
newData$EVTYPE <- gsub('HEAVY SNOW', 'SNOW', newData$EVTYPE)

To further simplify our dataset, we will split it into one set for population variables and one set for economy variables. We will only examine rows that have non-zero values in one of these entries.

popData <- filter(newData, (INJURIES > 0 | FATALITIES > 0))
econData <- filter(newData, (PROPDMG > 0 | CROPDMG > 0 ))

For the economic data, we need to convert the damage columns given into the actual dollar amount of damage. To do this, we first examine the exponent variables.

table(newData$PROPDMGEXP)
## 
##             -      ?      +      0      1      2      3      4      5 
## 460832      1      8      3    214     25     13      4      3     25 
##      6      7      8      B      h      H      K      m      M 
##      4      5      1     17      0      6 422240      6  11025
table(newData$CROPDMGEXP)
## 
##             ?      0      2      B      k      K      m      M 
## 611773      6     18      1      6     20 280756      0   1852

The PROPDMGEXP contains every exponent value except for ‘k’, so we can create a list of all the exponents as follows:

exp <- append(as.character(unique(newData$PROPDMGEXP)),'k')

Now we need to create a vector that matches these exponent codes to actual numbers. B, M, K, and H will be billion, million, thousand, and hundred. Any numeric value will be interpreted as an exponent (ie. 2 means 10^2). The meaning of +, -, and ? is ambiguous, so these will just be set to 0.

expNum <- c(1e3, 1e6, 0, 1e9, 0, 1, 1e5, 1e6, 1e6, 0,
            1e4, 1e2, 1e3, 1e7, 1e2, 0, 1e1, 1e8, 1e3)
expDf <- data.frame(exp)
expDf$expNum <- expNum
expDf
##    exp expNum
## 1    K  1e+03
## 2    M  1e+06
## 3       0e+00
## 4    B  1e+09
## 5    +  0e+00
## 6    0  1e+00
## 7    5  1e+05
## 8    m  1e+06
## 9    6  1e+06
## 10   ?  0e+00
## 11   4  1e+04
## 12   2  1e+02
## 13   3  1e+03
## 14   7  1e+07
## 15   H  1e+02
## 16   -  0e+00
## 17   1  1e+01
## 18   8  1e+08
## 19   k  1e+03

We are interested in the total amount of damage done. This will be the sum of the total crop damage and total property damage. To get each of these individual values, we need to multiply the value in *DMG by the correct exponent value from our expDf table. First we add a new column to econData that will store the correct numeric value for each exponent factor. Then we multiply and add the two damages to get the total damage.

econData$peNum <- apply(econData['PROPDMGEXP'],1,  function(x) expDf[x == expDf$exp, 2])
econData$ceNum <- apply(econData['CROPDMGEXP'],1, function(x) expDf[x==expDf$exp, 2])
econData <- mutate(econData, dmgTotal = PROPDMG*peNum + CROPDMG*ceNum)

Finally we calculate the sums of these damages, grouped by event type. We sort them So we can see the top values.

econSum <- summarise(group_by(econData, EVTYPE), Total = sum(dmgTotal))
econSum <- arrange(econSum, desc(Total))

For fatalities and injuries, no conversion is needed. We can simply sum the injuries and fatalities, and group by the event type.

fatSum <- summarise(group_by(popData, EVTYPE), Fatalities = sum(FATALITIES))
fatSum <- arrange(fatSum, desc(Fatalities))
injSum <- summarise(group_by(popData, EVTYPE), Injuries = sum(INJURIES))
injSum <- arrange(injSum, desc(Injuries))

Results

First let’s examine the results for the total economic damage.

hist(econSum[[2]], xlab = 'Total Damage ($)', col='green',
     main = 'Distribution of Total Damage by Event Type')

Here we have a histogram showing the distribution of the total economic damage for each of the event types. It is clear from the histogram that the majority of the event types have a fairly small effect, and that a very small number have a significantly larger effect. To see what they are, we look at the top 5 categories.

econSum[1:5,]
## Source: local data frame [5 x 2]
## 
##        EVTYPE        Total
##         (chr)        (dbl)
## 1       FLOOD 150386476000
## 2     TORNADO  57362333884
## 3 STORM SURGE  43323541000
## 4 FLASH FLOOD  18835965528
## 5        HAIL  18761221926

Flood events have triple the amount of damage of the second most damaging event, so clearly they have the most significant economic impact.

Next we examine the results for population impact.

par(mfrow=c(1,2))
hist(injSum[[2]], xlab = 'Total Injuries', col = 'red',
     main = 'Distribution of Injuries by Event Type')
hist(fatSum[[2]], xlab = 'Total Fatalities', col='blue',
     main= 'Distribution of Fatalities by Event Type')

Similar to the economic impact, we see that a vast majority of events have a relatively smaller number of injuries and fatalities, with a few events having a significantly higher impact. We again examine the top 5 in each category.

injSum[1:5,]
## Source: local data frame [5 x 2]
## 
##              EVTYPE Injuries
##               (chr)    (dbl)
## 1           TORNADO    91346
## 2 THUNDERSTORM WIND     9353
## 3             FLOOD     6868
## 4    EXCESSIVE HEAT     6525
## 5         LIGHTNING     5230
fatSum[1:5,]
## Source: local data frame [5 x 2]
## 
##           EVTYPE Fatalities
##            (chr)      (dbl)
## 1        TORNADO       5633
## 2 EXCESSIVE HEAT       1903
## 3    FLASH FLOOD       1014
## 4           HEAT        937
## 5      LIGHTNING        816

Tornadoes have nearly 10 times the number of injuries, and almost three times the number of fatalities compared to the second highest event type. With such a wide margin in both injuries and fatalities, it is clear that tornadoes have the highest impact on the health of the population.