This publication is part of the Coursera Reproducible Research class taught by Dr. Roger Peng as part of the Data Science Specialization.

Executive Summary

Using data collected and documented by the National Oceanic & Atmospheric Administration, We want to answer these two questions:

  1. Across the United States, which types of events are most harmful with respect to population health?
  2. Across the United States, which types of events have the greatest economic consequences?

In their raw format, the data have a lot of variability in how specific weather events are labeled as the result of abbreviations, typos, etc. The first part of the processing code simplifies and combines the Event Type classification variable. Then plots are constructed to assess the top ten extreme weather event types by Fatalities (to guage effect on population health) and Property Damage (as a proxy for economic damage).

Data Processing

if (!file.exists("repdata-data-StormData.csv.bz2"))  {
     link<-"https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2" #From the Course Website
     download.file(link,"repdata-data-StormData.csv.bz2")
     }
if (!exists("stormDF")) stormDF <- read.csv("repdata-data-StormData.csv.bz2")
#Let's examine the data
str(stormDF)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","  N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...
head(table(stormDF$EVTYPE),10)
## 
##    HIGH SURF ADVISORY         COASTAL FLOOD           FLASH FLOOD 
##                     1                     1                     1 
##             LIGHTNING             TSTM WIND       TSTM WIND (G45) 
##                     1                     4                     1 
##            WATERSPOUT                  WIND                     ? 
##                     1                     1                     1 
##       ABNORMAL WARMTH 
##                     4

Cleaning the Data

In this raw format, our event type labels are not adequate. For example, we have events listed under WINTER STORM HIGH WINDS, WINTER STORM/HIGH WINDS, and WINTER STORM/HIGH WIND as well as THUNDERSTORM WINDS and TSTM WINDS. We need to consolidate the categories as best we can into the 48 official categories given by the National Weather Service.

Cleaning Labels

First, Labels will be cleaned by:

  • Capitalize all labels
  • Removing Whitespace
  • Remove multiple events indicated by slashes. If multiple events are listed and separated by slashes, just take the first one.
  • Remove Plurals

Combine similar but different events

Second labels will be combined based on keyword or abbreviation. For example, “ice” and “icy” should both qualify as “ice storm,” and “thunder” or “tstm” should both qualify as thunderstorms (unless paired with “marine”).

#Goal is to have 48 types (or as near as possible). Currently have 985
eventTypes <- stormDF$EVTYPE
length(levels(eventTypes))
## [1] 985
#Step 1: Capitalize all the Labels
eventTypes <- toupper(as.character(eventTypes))
#Step 2: Remove excess whitespace
eventTypes <- gsub("^\\s+|\\s+$", "", eventTypes)
eventTypes <- gsub("\\s{2}", " ", eventTypes)
#Step 3: Remove multiple event listings
getFirst<-function(label) {
     if (!grepl("\\/",label)) {return(label)}
     if (label == "EXTREME COLD/WIND CHILL" |
         label == "FROST/FREEZE" |
         label == "STORM SURGE/TIDE"
          ) {return(label)}
     label <- unlist(strsplit(label, split="/"))
     return(label[1])
}
eventTypes <- sapply(eventTypes,FUN=getFirst)
#Step 4: Remove Plurals
eventTypes <- gsub("[S]$","", eventTypes)
length(levels(as.factor(eventTypes)))
## [1] 694
#Combines
majLabels <- function(label) {
    if (grepl("TORNADO",label)) {return("TORNADO")}
    if (grepl("BLIZZARD",label)) {return("BLIZZARD")}
    if (grepl("HAIL",label))  {return("HAIL")}
    if (grepl("COLD|CHILL|COOL",label))  {return("EXTREME COLD/WIND CHILL")}
    if (grepl("HEAT|HOT|WARM",label))  {return("EXCESSIVE HEAT")}
    if (grepl("HIGH WIND",label))  {return("HIGH WIND")}
    if (grepl("HEAVY SNOW",label))  {return("HEAVY SNOW")}
    if (grepl("TROPICAL STORM",label))  {return("TROPICAL STORM")}
    if (grepl("SNOW",label))  {return("HEAVY SNOW")}
    if (grepl("HEAVY RAIN",label))  {return("HEAVY RAIN")}
    if (grepl("DRY|DROUGHT",label))  {return("DROUGHT")}
    if (grepl("ICE|ICY",label))  {return("ICE STORM")}
    if (grepl("HURRICANE|TYPHOON",label))  {return("HURRICANE (TYPHOON)")}
    
    if (!grepl("MARINE",label) & 
          grepl("TSTM|THUN",label)) {return("THUNDERSTORM WIND")}

    if (!grepl("FLASH|LAKESHORE",label) & 
          grepl("FLOOD",label)) {return("FLOOD")}
    
    label
}
eventTypes <- sapply(eventTypes,FUN=majLabels)
length(levels(as.factor(eventTypes)))
## [1] 313
#Reapply to dataset and filter 
stormDF$EVTYPE <- as.factor(eventTypes)
keepRow <- function(label) {
     masterList <- toupper(c("Astronomical Low Tide" , 
                         "Avalanche" , 
                         "Blizzard" , 
                         "Coastal Flood" , 
                         "Cold/Wind Chill" , 
                         "Debris Flow" , 
                         "Dense Fog" , 
                         "Dense Smoke" , 
                         "Drought" , 
                         "Dust Devil" , 
                         "Dust Storm" , 
                         "Excessive Heat" , 
                         "Extreme Cold/Wind Chill" , 
                         "Flash Flood" , 
                         "Flood" , 
                         "Frost/Freeze" , 
                         "Funnel Cloud" , 
                         "Freezing Fog" , 
                         "Hail" , 
                         "Heat" , 
                         "Heavy Rain" , 
                         "Heavy Snow" , 
                         "High Surf" , 
                         "High Wind" , 
                         "Hurricane (Typhoon)" , 
                         "Ice Storm" , 
                         "Lake-Effect Snow" , 
                         "Lakeshore Flood" , 
                         "Lightning" , 
                         "Marine Hail" , 
                         "Marine High Wind" , 
                         "Marine Strong Wind" , 
                         "Marine Thunderstorm Wind" , 
                         "Rip Current" , 
                         "Seiche" , 
                         "Sleet" , 
                         "Storm Surge/Tide" , 
                         "Strong Wind" , 
                         "Thunderstorm Wind" , 
                         "Tornado" , 
                         "Tropical Depression" , 
                         "Tropical Storm" , 
                         "Tsunami" , 
                         "Volcanic Ash" , 
                         "Waterspout" , 
                         "Wildfire" , 
                         "Winter Storm" , 
                         "Winter Weather"))
     if (label %in% masterList) {return(TRUE)}
     return(FALSE)
}
stormDF <- subset(stormDF, sapply(EVTYPE,FUN=keepRow),TRUE)

Results

Data Analysis to answer First Question

Across the United States, which types of events are most harmful with respect to population health? To answer this, let’s identify the event types responsible for the most deaths throughout the entire time range of this dataset.

#Use tapply to summarize FATALITIES by EVTYPE
sums<-sort(tapply((stormDF$FATALITIES),stormDF$EVTYPE,sum),decreasing=TRUE)
#Sort this summary and display the table
topEvents<-data.frame(Event_Type=factor(names(sums[1:10])),Deaths=as.numeric(sums[1:10]))
print(topEvents)
##                 Event_Type Deaths
## 1                  TORNADO   5658
## 2           EXCESSIVE HEAT   3159
## 3              FLASH FLOOD    994
## 4                LIGHTNING    816
## 5        THUNDERSTORM WIND    711
## 6              RIP CURRENT    577
## 7                    FLOOD    506
## 8  EXTREME COLD/WIND CHILL    467
## 9                HIGH WIND    297
## 10               AVALANCHE    224

And the graphical representation of this same table

#Take the table produce above and use ggplot2 to create a bar graph
graph.health <- ggplot(data=topEvents)
graph.health <- graph.health + aes(x=reorder(Event_Type,Deaths),y=Deaths,fill=factor(reorder(Event_Type,Deaths)),show_guide = FALSE) +
        ggtitle("Fig 1. Fatalities by Event Type") +
        xlab("Event Type") +
        geom_bar(stat="identity") + 
        theme(axis.text.x=element_text(angle=45, hjust=1)) +
        scale_fill_manual(values=c(brewer.pal(9,"Reds"),"#000000")) +
        guides(fill=FALSE)
graph.health

So we can see from the graph and from the table that Tornados have caused the most harm in terms of population health (deaths) than other forms of severe weather.

Data Analysis to answer Second Question

Across the United States, which types of events have the greatest economic consequences? Let’s also look at how much aggregate dollars property (PROPDMG) and crop damage (CROPDMG) was done by each event by type.

To do this, we will take the sum of all Property Damage + all Crop Damage across all storms, grouped by event type. Then display a sorted table for the top 10 Storm types by dollars of damage.

sums<-sort(tapply((stormDF$PROPDMG + stormDF$CROPDMG),stormDF$EVTYPE,sum),decreasing=TRUE)
topEvents<-data.frame(Event_Type=factor(names(sums[1:10]),ordered=TRUE),Damage=as.numeric(sums[1:10]))
print(topEvents)
##           Event_Type     Damage
## 1            TORNADO 3314617.78
## 2  THUNDERSTORM WIND 2873494.34
## 3        FLASH FLOOD 1603895.55
## 4               HAIL 1272128.79
## 5              FLOOD 1157771.83
## 6          LIGHTNING  606937.39
## 7          HIGH WIND  403286.98
## 8         HEAVY SNOW  154043.28
## 9       WINTER STORM  135699.58
## 10          WILDFIRE   89923.54

That same table in graphical form

graph.econ <- ggplot(data=topEvents)
graph.econ <- graph.econ + aes(x=reorder(Event_Type,Damage),y=Damage,fill=factor(reorder(Event_Type,Damage)),show_guide = FALSE) +
        ggtitle("Fig 2. Economic Damage by Event Type") +
        xlab("Event Type") +
        geom_bar(stat="identity") + 
        theme(axis.text.x=element_text(angle=45, hjust=1)) +
        scale_y_continuous(labels=dollar) +
        scale_fill_manual(values=c(brewer.pal(9,"Greens"),"#000000")) +
        guides(fill=FALSE)
graph.econ

So we can see from the graph and from the table that, again, Tornados have caused the most economic harm (in terms of economic damage to crops and property) than other forms of severe weather.