Synopsis

The Purpose of this document is to examine the most damaging storm types recorded in the US based on NOAA data downloaded from: https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2 More information on this data-set can be found at: https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf We will look at the most damaging storms in terms of health, (fatalities and injuries) and economic consequences. (crop and property damage in USD) This document will describe all data preparation and transformations, along with the results of the analysis. The dplyr, ggplot2, scales, and reshape2 libraries are used for the analysis.

Data Processing

First, we load the necessary libraries and data-set .csv file downloaded from the above link

library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(scales)
library(reshape2)
SData <- read.csv("repdata-data-StormData.csv")

Next, we subset the data to make it more manageable, extracting only the columns we need for the analysis. These include the event type, number of fatalities and injuries, property and crop damage, and the columns that denote the magnitude of damage in thousands, millions, or billions. (PROPDMGEXP and CROPDMGEXP)

IData <- SData[c("EVTYPE","FATALITIES","INJURIES","PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP")]

Economic Damage Data

We need to process the data the extract the most useable and accurate results for both economic and health consequences. Since the economic data has more processing steps, we will start there.

Upon examining the data, there are a few anomalies we need to address.

str(IData)
## 'data.frame':    902297 obs. of  7 variables:
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...

First, the variables that describe the damage magnitude are not case consistent and contain values which we can’t recognize based on the data-set description provided by the NOAA website.

levels(IData$PROPDMGEXP)
##  [1] ""  "-" "?" "+" "0" "1" "2" "3" "4" "5" "6" "7" "8" "B" "h" "H" "K"
## [18] "m" "M"
levels(IData$CROPDMGEXP)
## [1] ""  "?" "0" "2" "B" "k" "K" "m" "M"

Luckily, few of the 902297 observations in the data-set have unrecognized characters in the PROPDMGEXP and CROPDMGEXP fields, so they can be ignored without upsetting the aggregate analysis much. We can transform all the cases to upper for the values that are recognized (k, m, and b in the case of the magnitude variables) for consistency. To be on the safe side, we can convert cases for the Event Types to upper as well. Since R interprets the same character in different cases as being unique, this should alleviate any issues upper/lower case inconsistency may cause.

IData$PROPDMGEXP <- factor(toupper(IData$PROPDMGEXP))
IData$CROPDMGEXP <- factor(toupper(IData$CROPDMGEXP))
IData$EVTYPE <- toupper(IData$EVTYPE)

Next, we want to change the variable names of the base damage value to identify them as such and create numeric variables that multiply each base by its magnitude in dollars so we can compare “apples to apples” as far as damage value is concerned. Any unknown magnitude entries will be converted at 0 value.

IData <- rename(IData, PROPDMGBASE = PROPDMG, CROPDMGBASE = CROPDMG)
IData$PROPDMG <- ifelse(IData$PROPDMGEXP == "K", IData$PROPDMGBASE * 1000, 
                           ifelse(IData$PROPDMGEXP == "M",IData$PROPDMGBASE * 1000000, 
                                  ifelse(IData$PROPDMGEXP == "B",IData$PROPDMGBASE * 1000000000, 0)))

IData$CROPDMG <- ifelse(IData$CROPDMGEXP == "K", IData$CROPDMGBASE * 1000, 
                           ifelse(IData$CROPDMGEXP == "M",IData$CROPDMGBASE * 1000000, 
                                  ifelse(IData$CROPDMGEXP == "B",IData$CROPDMGBASE * 1000000000, 0)))

We can then create a total damage variable that we will use to rank the different storm types by the amount of total economic damage caused. While we are at it, we can also create a total injuries and fatalities variable for use in the health analysis.

IData$TOTDMG <- IData$PROPDMG + IData$CROPDMG
IData$TOTINJFAT <- IData$FATALITIES + IData$INJURIES

At this point, we have created all the variables we will be working with and cleaned the data. It is a good idea to take a look at what we have:

str(IData)
## 'data.frame':    902297 obs. of  11 variables:
##  $ EVTYPE     : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ FATALITIES : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES   : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMGBASE: num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP : Factor w/ 17 levels "","-","?","+",..: 16 16 16 16 16 16 16 16 16 16 ...
##  $ CROPDMGBASE: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP : Factor w/ 7 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ PROPDMG    : num  25000 2500 25000 2500 2500 2500 2500 2500 25000 25000 ...
##  $ CROPDMG    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ TOTDMG     : num  25000 2500 25000 2500 2500 2500 2500 2500 25000 25000 ...
##  $ TOTINJFAT  : num  15 0 2 2 2 6 1 0 15 0 ...

Our next step will be to group damage by storm type (factor level), summarize by the sum of Crop, Property, and Total Damage, and arrange in descending order by damage caused using the dplyr package so we can identify the top 10 most damaging storm types.

TDmg <- group_by(IData, EVTYPE)
DmgSummary <- summarize(TDmg, PROPDMG = sum(PROPDMG), CROPDMG = sum(CROPDMG), TOTALDMG = sum(TOTDMG))
DmgSummary <- arrange(DmgSummary, desc(TOTALDMG))

We can now preview the top 10:

head(DmgSummary, 10)
## Source: local data frame [10 x 4]
## 
##               EVTYPE      PROPDMG     CROPDMG     TOTALDMG
## 1              FLOOD 144657709800  5661968450 150319678250
## 2  HURRICANE/TYPHOON  69305840000  2607872800  71913712800
## 3            TORNADO  56937160480   414953110  57352113590
## 4        STORM SURGE  43323536000        5000  43323541000
## 5               HAIL  15732266720  3025954450  18758221170
## 6        FLASH FLOOD  16140811510  1421317100  17562128610
## 7            DROUGHT   1046106000 13972566000  15018672000
## 8          HURRICANE  11868319010  2741910000  14610229010
## 9        RIVER FLOOD   5118945500  5029459000  10148404500
## 10         ICE STORM   3944927810  5022113500   8967041310

At this point, we subset the data into the top 10 most economically damaging categories for ease of graphing. Note that we are sub setting both the summary data and the data-set with all of the individual points so we can see both the aggregate storm damage and the most individually damaging storms within each category.

Top10Type <- head(DmgSummary$EVTYPE, 10) #get the names of the top 10 from the EVTYPE variable
Top10Dmg <- filter(DmgSummary, EVTYPE %in% Top10Type) #use these names to filter the data
Top10Dat <- filter(IData, EVTYPE %in% Top10Type)

Health Damage Data

We follow a similar process to extract the injury and fatality information we want from the raw data. This data is easier to work with, and we have the added benefit of using a few objects we created for the economic damage data. Since we already grouped the initial data set we created by the event type, we can use this object (TDmg) to summarize the number of injuries and fatalities.

HlthSummary <- summarize(TDmg, FATALITIES = sum(FATALITIES), INJURIES = sum(INJURIES), TOTINJFAT = sum(TOTINJFAT))

Now, we arrange this summary in descending order by the total and filter the top 10 most dangerous storm types. Again, we filter both the summary and the individual data points to see if there are any extreme individual cases.

HlthSummary <- arrange(HlthSummary, desc(TOTINJFAT))
Top10HType <- head(HlthSummary$EVTYPE, 10) #get the names of the top 10 from the EVTYPE variable
Top10HDmg <- filter(HlthSummary, EVTYPE %in% Top10HType) #use these names to filter the data
Top10HDat <- filter(IData, EVTYPE %in% Top10HType)

We can now preview the top 10 most dangerous storms to health:

head(HlthSummary, 10)
## Source: local data frame [10 x 4]
## 
##               EVTYPE FATALITIES INJURIES TOTINJFAT
## 1            TORNADO       5633    91346     96979
## 2     EXCESSIVE HEAT       1903     6525      8428
## 3          TSTM WIND        504     6957      7461
## 4              FLOOD        470     6789      7259
## 5          LIGHTNING        816     5230      6046
## 6               HEAT        937     2100      3037
## 7        FLASH FLOOD        978     1777      2755
## 8          ICE STORM         89     1975      2064
## 9  THUNDERSTORM WIND        133     1488      1621
## 10      WINTER STORM        206     1321      1527

Finally, we can use the reshape2 package to melt the data into a long format for graphing. This will convert our damage type and injury variables into factor levels so we can easily view them on a stacked bar chart.

Top10HDmg.m <- melt(Top10HDmg[,1:3], id.vars='EVTYPE', variable.name = "HLTHDMGTYPE", value.name = "DMGNUMBER")
Top10HDat.m <- melt(Top10HDat[,c(1:3)], id.vars='EVTYPE', variable.name = "HLTHDMGTYPE", value.name = "DMGNUMBER")
Top10Dmg.m <- melt(Top10Dmg[,1:3], id.vars='EVTYPE', variable.name = "DMGTYPE", value.name = "DMGVALUE")
Top10Dat.m <- melt(Top10Dat[,c(1,8,9)], id.vars='EVTYPE', variable.name = "DMGTYPE", value.name = "DMGVALUE")

Results

Now that we have adequately processed the data, we can view the graphical and tabular results. First we print out the top 10 list of storms that pose health risks and top 10 list that causes property damage. We can make sure these subsets agree with our earlier previews.

Top10HDmg
## Source: local data frame [10 x 4]
## 
##               EVTYPE FATALITIES INJURIES TOTINJFAT
## 1            TORNADO       5633    91346     96979
## 2     EXCESSIVE HEAT       1903     6525      8428
## 3          TSTM WIND        504     6957      7461
## 4              FLOOD        470     6789      7259
## 5          LIGHTNING        816     5230      6046
## 6               HEAT        937     2100      3037
## 7        FLASH FLOOD        978     1777      2755
## 8          ICE STORM         89     1975      2064
## 9  THUNDERSTORM WIND        133     1488      1621
## 10      WINTER STORM        206     1321      1527
Top10Dmg
## Source: local data frame [10 x 4]
## 
##               EVTYPE      PROPDMG     CROPDMG     TOTALDMG
## 1              FLOOD 144657709800  5661968450 150319678250
## 2  HURRICANE/TYPHOON  69305840000  2607872800  71913712800
## 3            TORNADO  56937160480   414953110  57352113590
## 4        STORM SURGE  43323536000        5000  43323541000
## 5               HAIL  15732266720  3025954450  18758221170
## 6        FLASH FLOOD  16140811510  1421317100  17562128610
## 7            DROUGHT   1046106000 13972566000  15018672000
## 8          HURRICANE  11868319010  2741910000  14610229010
## 9        RIVER FLOOD   5118945500  5029459000  10148404500
## 10         ICE STORM   3944927810  5022113500   8967041310

We can now graph our results using ggplot. This allows us to create a stacked bar chart that will break out the different levels of damage type and health risk, and superimpose a box-plot that will allow us to see outliers of extreme individual events on the same graph.

ggplot(Top10HDmg.m, aes(x = reorder(EVTYPE, DMGNUMBER), y = DMGNUMBER,fill=HLTHDMGTYPE)) +
    geom_bar(stat="identity") +
    geom_boxplot(data = Top10HDat.m, aes(x=EVTYPE, y=DMGNUMBER), show_guide = F) +
    scale_y_continuous(labels = comma) +
    coord_flip() + labs(y='Number of Individuals',x='Storm Type') +
    guides(fill=guide_legend(title="Severe Points Per Set:\nUpper Points = Injuries\nLower Points = Fatalities", 
    title.position = "bottom")) +
    scale_fill_manual(labels=c("Fatalities", "Injuries"),values = c("coral", "burlywood2")) +
    ggtitle("Top 10 Most Dangerous Types of Storms to Health \nPoints Represent Individual Storms of Severe Magnitude") +
    theme(plot.title = element_text(lineheight=.8, face="bold"))

ggplot(Top10Dmg.m, aes(x = reorder(EVTYPE, DMGVALUE), y = DMGVALUE,fill=DMGTYPE)) +
    geom_bar(stat="identity") +
    geom_boxplot(data = Top10Dat.m, aes(x=EVTYPE, y=DMGVALUE), show_guide = F) +
    scale_y_continuous(labels = comma) +
    coord_flip() + labs(y='Damage in USD',x='Storm Type') +
    guides(fill=guide_legend(title="Severe Points Per Set:\nUpper Points = Crop\nLower Points = Property", 
                             title.position = "bottom")) +
    scale_fill_manual(labels=c("Property Damage", "Crop Damage"),values = c("coral", "burlywood2")) +
    ggtitle("Top 10 Most Economically Damaging Types of Storms \nPoints Represent Individual Storms of Severe Magnitude") +
    theme(plot.title = element_text(lineheight=.8, face="bold"))

Based on these results, we can easily see the top 10 most dangerous storm types to health and the top 10 most economically damaging storm types. These can be compared in descending order by magnitude based to evalute which types pose the most risk and require the greatest precaution. We can see that while Tornados have historically caused the most injuries and fatalities by a large margin, floods have far-and-away outstripped all other storm types in terms of overall property damage. It’s also interesting to note that while single-event severe magnitude occurances seem to be bunched fairly close together and lower on the scale of heath risk, there are signifant outliers in events that cause property damage that appear to make up a signifigant portion of the total damage. This would indicate that a few events in the Flood, Typhoon, Storm Surge, River Flood, and Ice Storm categories caused a large portion of the overall damage, and should be studied further.