Synopsis/Introduction

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.

Research Questions

Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

Across the United States, which types of events have the greatest economic consequences?

Data Processing

Do all the preparation

Load the data.

setwd("~/Coursera/Reproducible Research/Project/Project2")
datadir  <- "~/Coursera/Reproducible Research/Project/Project2"
noaaData<-read.csv(gzfile("repdata-data-StormData.csv.bz2"))

There are 902297 observations in this data. Load required Libraries

library(gdata)
## gdata: Unable to locate valid perl interpreter
## gdata: 
## gdata: read.xls() will be unable to read Excel XLS and XLSX files
## gdata: unless the 'perl=' argument is used to specify the location
## gdata: of a valid perl intrpreter.
## gdata: 
## gdata: (To avoid display of this message in the future, please
## gdata: ensure perl is installed and available on the executable
## gdata: search path.)
## gdata: Unable to load perl libaries needed by read.xls()
## gdata: to support 'XLX' (Excel 97-2004) files.
## 
## gdata: Unable to load perl libaries needed by read.xls()
## gdata: to support 'XLSX' (Excel 2007+) files.
## 
## gdata: Run the function 'installXLSXsupport()'
## gdata: to automatically download and install the perl
## gdata: libaries needed to support Excel XLS and XLSX formats.
## 
## Attaching package: 'gdata'
## 
## The following object is masked from 'package:stats':
## 
##     nobs
## 
## The following object is masked from 'package:utils':
## 
##     object.size
library(plyr)

Clean the Data

The EVTYPE, PROPDMGEXP and CROPDMGEXP are inconsistent with regards to case so I am going to convert them all to uppercase.

There are also a lot of EVTYPE values that are the same but have different spellings or other numbers/characters as well as the correct value so I am going to correct those too.

# Convert EVTYPE to all uppercase goes from 985 to 898 distinct EVTYPE values
noaaData$EVTYPE <- toupper(noaaData$EVTYPE)
# Convert the damage codes to upercase too for use when calculaing the economic cost
noaaData$PROPDMGEXP <- toupper(noaaData$PROPDMGEXP)
noaaData$CROPDMGEXP  <- toupper(noaaData$CROPDMGEXP)
# Time leading and training whiespace goes from 898 to 890 distinct EVTYPE Vals
noaaData$EVTYPE <- trim(noaaData$EVTYPE)
# Try to correct obvious spelling/typing/classification mistakes
#Goes from 890 to 408 distinct EVTYPE vals
evtypelist <-as.character(noaaData$EVTYPE)
evtypelist <- gsub(".*THUNDERSTORM.*", "THUNDERSTORM", evtypelist, ignore.case=TRUE)
evtypelist <- gsub(".*TSTM.*",         "THUNDERSTORM", evtypelist, ignore.case=TRUE)
evtypelist <- gsub(".*TORNADO.*",      "TORNADO",      evtypelist, ignore.case=TRUE)
evtypelist <- gsub(".*HURRICANE.*",    "HURRICANE",    evtypelist, ignore.case=TRUE)
evtypelist <- gsub(".*HEAT.*",         "HEAT",         evtypelist, ignore.case=TRUE)
evtypelist <- gsub(".*WARM.*",         "WARM",         evtypelist, ignore.case=TRUE)
evtypelist <- gsub(".*SNOW.*",         "SNOW",         evtypelist, ignore.case=TRUE)
evtypelist <- gsub(".*COLD.*",         "COLD",         evtypelist, ignore.case=TRUE)
evtypelist <- gsub(".*COOL.*",         "COLD",         evtypelist, ignore.case=TRUE)
evtypelist <- gsub(".*RAIN.*",         "RAIN",         evtypelist, ignore.case=TRUE)
evtypelist <- gsub(".*SURF.*",         "SURF",         evtypelist, ignore.case=TRUE)
evtypelist <- gsub(".*FLOOD.*",        "FLOOD",        evtypelist, ignore.case=TRUE)
evtypelist <- gsub(".*RISING WATER.*", "FLOOD",        evtypelist, ignore.case=TRUE)
evtypelist <- gsub(".*AVALAN.*",       "AVALANCHE",    evtypelist, ignore.case=TRUE)
evtypelist <- gsub(".*MUDSLIDE.*",     "MUDSLIDE",     evtypelist, ignore.case=TRUE)
evtypelist <- gsub(".*LANDSLIDE.*",    "LANDSLIDE",    evtypelist, ignore.case=TRUE)
evtypelist <- gsub(".*FIRE.*",         "FIRE",         evtypelist, ignore.case=TRUE)
evtypelist <- gsub(".*SEAS.*",         "SEA",          evtypelist, ignore.case=TRUE)
evtypelist <- gsub(".*SWELLS.*",       "SEA",          evtypelist, ignore.case=TRUE)
evtypelist <- gsub(".*WAVE.*",         "SEA",          evtypelist, ignore.case=TRUE)
evtypelist <- gsub(".*BEACH.*",        "BEACH",        evtypelist, ignore.case=TRUE)
evtypelist <- gsub(".*BITTER.*",       "COLD",         evtypelist, ignore.case=TRUE)
noaaData$EVTYPE <- evtypelist

There are still 902297 observations in the data

Lets add a new field containing the total economic cost of each of the events. This can be calculated bye adding together the crop damage

#I need to calculate the cost of those events that are due to damage using a function

damage <-   function(a, exp) {
            if (any(exp == "B"))
            return (a * 1000000000)
            else if (exp == "M")
            return (a * 1000000)
            else if (exp == "H")
            return (a * 100000)
            else if (exp == "K")
            return (a * 1000)
            else
            return (1)
}

# Economic damage is caused by property and crop damage added together

noaaData$total_cost<-(damage(noaaData$PROPDMG, noaaData$PROPDMGEXP) + damage(noaaData$CROPDMG, noaaData$CROPDMGEXP))

Limit Data to those with either a positive injury or fatality value

healthimpactData <- noaaData[noaaData$FATALITIES > 0 | noaaData$INJURIES > 0,]
healthimpactData$totImpact <- healthimpactData$FATALITIES + healthimpactData$INJURIES

Now there are 21929 observations.

Lets try to summarise that data and sort it.

healthimpind<-ddply(healthimpactData, c("EVTYPE"), summarise, N=length(totImpact),
                     total_impact=sum(totImpact),
                     average_impact=mean(totImpact), 
                     median_impact=median(totImpact)
                     )

healthimpind<-healthimpind[order(-healthimpind$total_impact),]

Now there are 88 observations. Subset the health impact index to only contain the top 10

t10_health_impact<-healthimpind[1:10,]

Going back to the economic cost letslimit Data to those with a positive economic cost

costimpactData <- noaaData[noaaData$total_cost >0 ,]

Now there are 245031 observations.

Lets try to summarise that data and sort it.

costimpind<-ddply(costimpactData, c("EVTYPE"), summarise, N=length(total_cost),
                     total_impact=sum(total_cost),
                     average_impact=mean(total_cost), 
                     median_impact=median(total_cost)
                     )

costimpind<-arrange(costimpind,desc(total_impact))

Now there are 160 observations. Subset the cost impact index to only contain the top 10

t10_cost_impact<-costimpind[1:10,]

Results

To answer the first question which was across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

The top events for causing either a fatality or an injury are:

par(cex.main=1.25, cex=0.75)
barplot(t10_health_impact$total_impact, 
        main="Total population health harm per event type", 
        ylab= "Persons harmed",
        xlab="Event type",
        col = c(1,2,3,4,5,6,7,8,9,10),
        legend = t10_health_impact$EVTYPE)

plot of chunk unnamed-chunk-11

So as you can see the Tornado is the event that by far causes the most fatalies/injuries to people in the US.

To answer the second question which was across the United States, which types of events have the greatest economic consequences?

The top costing events are:

par(cex.main=1.25, cex=0.75)
barplot(t10_cost_impact$total_impact, 
        main="Total economic consequence per event type", 
        ylab= "cost ($)",
        xlab="Event type",
        col = c(1,2,3,4,5,6,7,8,9,10),
        legend = t10_cost_impact$EVTYPE)

plot of chunk unnamed-chunk-12

So as you can see once again the Tornado is the event that causes the highest economic consequences although Thunderstorms and Floods are also quite high.

Recommendations

I think that more time needs to be spent understanding the causes of these events and if there is any mitigation to reduce both fatalities/injuries and economic costs. A cost benefit analysis could be performed against suggested mitigations in order to see areas that investment could be focussed.

In order to get more accurate data to do better analysis the earlier years when the dataset population was inconsistent could also be removed. This would then also take into account the fact that building codes, safety procedures and farming practices are different in more modern times.

References

  1. Data File https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2
  2. Documentation https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf
  3. FAQ https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2FNCDC%20Storm%20Events-FAQ%20Page.pdf