Synopsis

In this report, I’ve explored the NOAA Storm Dataset in an attempt to answer the following basic questions about severe weather events:

  1. Across the US, which types of events (examples) are most harmful with respect to human health?
  2. Across the US, which types of events have the greatest economic consequences?

I’ve performed the analysis on only the relevant columns by tracking an overall dataset and a more accurate “sound” dataset with at least half of the maximum observations. The results seem to indicate that while tornadoes cause the most harm to human health, floods cause the most economic damage.

Data Processing

First, load required R packages:

library(ggplot2)
library(dplyr)
library(gridExtra)
if (!require(car)) {
    install.packages("car")
    require(car)
}

Next, from the same NOAA dataset I read in only the variables that are relevant to this analysis. These relevant columns include:

# Load relevant data columns only
if(!exists("storm.data")){
        storm.data <- read.csv("repdata-data-StormData.csv.bz2",
                               header=TRUE,
                               na.strings="",
                               colClasses=c('NULL', 'character', rep('NULL',5), 'factor', rep('NULL',14), 
                                            'numeric','numeric','numeric','factor','numeric','factor',
                                            rep('NULL',9)))

        # Manipulate date to get only the year (to see observation distribution by year, later on)
        storm.data$BGN_DATE <- as.numeric(format(as.POSIXct(storm.data$BGN_DATE,"%m/%d/%Y", tz=""), "%Y"))
        }

Once loaded, the data needs to be cleaned. Specifically:

Note: The Storm Data Documentation specifies the exponent codes as k = thousands, m = millions, b = billions. However, it is unclear from the handbook how other characters found in this column - such as “?”, “-”, “+” - are to be treated. In light of this uncertainty I assign an exponential value of 0 to these.

        # Change "?" EVTYPE to "NOT DEFINED"
        storm.data$EVTYPE <- gsub("^[?]$", "NOT DEFINED", storm.data$EVTYPE)
        
        # Trim leading whitespaces from EVTYPE
        storm.data$EVTYPE <- gsub("^\\s+", "", storm.data$EVTYPE)
        
        # Trim trailing whitespace
        storm.data$EVTYPE <- gsub("\\s+$", "", storm.data$EVTYPE)
        
        # Recode exponent columns PROPDMGEXP and CROPDMGEXP to numeric exponents (using recode() from "car" package)
        storm.data$PROPDMGEXP <- recode(tolower(storm.data$PROPDMGEXP), 
                                         "'h'=2; 'k'=3;'m'=6;'b'=9;
                                         c(NA,'?','-','+')=0", 
                                         as.numeric.result=TRUE)
        storm.data$CROPDMGEXP <- recode(tolower(storm.data$CROPDMGEXP), 
                                        "'h'=2; 'k'=3;'m'=6;'b'=9;
                                         c(NA,'?','-','+')=0", 
                                        as.numeric.result=TRUE)

Here is a sample snapshot of the cleaned dataset:

storm.data[sample(nrow(storm.data), 3), ]
##        BGN_DATE            EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP
## 342760     1998              HAIL          0        0       0          0
## 801619     2010         HIGH WIND          0        0       0          3
## 868695     2011 THUNDERSTORM WIND          0        0       6          3
##        CROPDMG CROPDMGEXP
## 342760       0          0
## 801619       0          3
## 868695       0          3

Data Analysis

It has been called out that that the data in the earlier decades might not have been captured as consistently, it is therefore important to see the distribution of observations over the years:

# histogram of the number of observations per year in the NOAA dataset
obs.plot <- ggplot(storm.data, aes(x=BGN_DATE))
obs.plot+geom_histogram(aes(fill=..count..))+labs(title="NOAA Storm Data Observations Distribution by Year", x="Year", y="Obervations", fill=" Observation Count \n (Color Scale)")+theme_bw()+theme(axis.text.x=element_text(angle=45, hjust=1))
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## Warning: position_stack requires constant width: output may be incorrect

It seems that at least half of the around 100,000 maximum number of observations (made in 2009), were made between the years of 1993 to 2011. For the sake of accuracy then, it is perhaps sound to take into consideration this data range as the analysis proceeds. This shall be tracked in a separate dataset “sound.data”. This data subset shall contain observations that amount to at least 50,000 per year which boils down to the range 1993-2011.

In line with this, I also create a sound data set with at least 50,000 observations per year:

#Create sound datasubset of storm.data
sound.data <- subset(storm.data, BGN_DATE>=1993)
rownames(sound.data) <- NULL
sound.data[sample(nrow(sound.data), 3), ]
##        BGN_DATE            EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP
## 26610      1995              HAIL          0        0     1.5          3
## 605067     2010       FLASH FLOOD          0        0     0.0          3
## 646044     2010 THUNDERSTORM WIND          0        0     5.0          3
##        CROPDMG CROPDMGEXP
## 26610      0.3          3
## 605067     0.0          3
## 646044     0.0          3

Moving on, although there seem to be some duplicates in the “EVTYPE” column, it is unclear from the Storm Data Documentation whether or not “TSTM WIND” is the same as “THUNDERSTORM WINDS”, or if “FLOOD” is to be equivalent to “FLASH FLOOD”. No assumptions have been made in this dataset, as a result; I have treated each weather event as is.

Extract Most Harmful Weather Events to Human Health

For the purposes of this analysis, the measure of damage to human health may be given by the sum of the fatalities and injuries caused by a type of weather event. What follows is the aggregation of the total fatalities and injuries (health casualties) by the various event types:

# For analysis of #1, create a new column that sums up the effects on health (defined
# as FATALITIES+INJURIES) per observation
storm.data$HEALTH.CASUALTIES <- storm.data$FATALITIES + storm.data$INJURIES
sound.data$HEALTH.CASUALTIES <- sound.data$FATALITIES + sound.data$INJURIES

# Now, aggregate sum of HEALTH.CASUALTIES by EVTYPE
harmful.events <- aggregate(HEALTH.CASUALTIES~EVTYPE, data=storm.data, FUN=sum)
harmful.events.sound <- aggregate(HEALTH.CASUALTIES~EVTYPE, data=sound.data, FUN=sum)
# arrange and get top 10 harmful events
harmful.events <- head(arrange(harmful.events, desc(HEALTH.CASUALTIES)), 10)
harmful.events.sound <- head(arrange(harmful.events.sound, desc(HEALTH.CASUALTIES)), 10)

We see that the top ten harmful (to human health) events across the USA for 1950 to 2011 are:

harmful.events
##               EVTYPE HEALTH.CASUALTIES
## 1            TORNADO             96979
## 2     EXCESSIVE HEAT              8428
## 3          TSTM WIND              7461
## 4              FLOOD              7259
## 5          LIGHTNING              6046
## 6               HEAT              3037
## 7        FLASH FLOOD              2755
## 8          ICE STORM              2064
## 9  THUNDERSTORM WIND              1621
## 10      WINTER STORM              1527

Whereas, the top then harmful events from the more complete dataset from 1993 up, are:

harmful.events.sound
##               EVTYPE HEALTH.CASUALTIES
## 1            TORNADO             24931
## 2     EXCESSIVE HEAT              8428
## 3              FLOOD              7259
## 4          LIGHTNING              6046
## 5          TSTM WIND              3872
## 6               HEAT              3037
## 7        FLASH FLOOD              2755
## 8          ICE STORM              2064
## 9  THUNDERSTORM WIND              1621
## 10      WINTER STORM              1527

Extract Most Economically Damaging Weather Events

Similar in principle to the above, a measure of economic damage may be defined (for the purposes of this analysis) by the sum of property and crop damage caused by a type of weather event:

# For analysis of #2, aggregate economic damage by Event type
# economic damage = property damage + crop damage
storm.data$econ.damage <- with(storm.data, PROPDMG*(10^PROPDMGEXP) + CROPDMG*(10^CROPDMGEXP))
sound.data$econ.damage <- with(sound.data, PROPDMG*(10^PROPDMGEXP) + CROPDMG*(10^CROPDMGEXP))

# Now, get top 10 events to cause most damage
Econ.Damage <- aggregate(econ.damage~EVTYPE, data=storm.data, FUN=sum)
Econ.Damage <- head(arrange(Econ.Damage, desc(econ.damage)), 10)

Econ.Damage.sound <- aggregate(econ.damage~EVTYPE, data=sound.data, FUN=sum)
Econ.Damage.sound <- head(arrange(Econ.Damage.sound, desc(econ.damage)), 10)

We see that the top ten most economically damaging events across the USA for 1957 thru 2011 are:

Econ.Damage
##               EVTYPE  econ.damage
## 1              FLOOD 150319678257
## 2  HURRICANE/TYPHOON  71913712800
## 3            TORNADO  57362333947
## 4        STORM SURGE  43323541000
## 5               HAIL  18761221986
## 6        FLASH FLOOD  18244041079
## 7            DROUGHT  15018672000
## 8          HURRICANE  14610229010
## 9        RIVER FLOOD  10148404500
## 10         ICE STORM   8967041360

Whereas the top ten most economically damaging events in the more complete dataset from 1993 up, are:

Econ.Damage.sound
##               EVTYPE  econ.damage
## 1              FLOOD 150319678257
## 2  HURRICANE/TYPHOON  71913712800
## 3        STORM SURGE  43323541000
## 4            TORNADO  26764135377
## 5               HAIL  18761221986
## 6        FLASH FLOOD  18244041079
## 7            DROUGHT  15018672000
## 8          HURRICANE  14610229010
## 9        RIVER FLOOD  10148404500
## 10         ICE STORM   8967041360

The final datasets are now setup. Onto the results.


Results

Most Harmful Weather Events

The top 10 weather related events that have caused the most harm to human health, as defined by a sum of the fatalities and injuries caused may be seen here:

#ggplot harmful events
harmful.events.plot <- ggplot(data=harmful.events, 
                              aes(x=reorder(EVTYPE, -HEALTH.CASUALTIES), 
                                  y=HEALTH.CASUALTIES)) + 
        geom_bar(stat="identity", fill="red", col="black") + 
        theme_bw() +
        labs(list(title="Top Ten Harmful Weather Events \n Across the US (1950-2011)",
                  x="Type of Event",
                  y="Total Health Damage (Fatalities + Injuries)")) +
        theme(legend.position="none")+
        theme(axis.text.x=element_text(angle=35, hjust=1))

harmful.events.sound.plot <- ggplot(data=harmful.events.sound, 
                              aes(x=reorder(EVTYPE, -HEALTH.CASUALTIES), 
                                  y=HEALTH.CASUALTIES)) + 
        geom_bar(stat="identity", fill="orange", col="black") + 
        theme_bw() +
        labs(list(title="Top Ten Harmful Weather Events \n Across the US (1993-2011)",
                  x="Type of Event",
                  y="")) +
        theme(legend.position="none")+
        theme(axis.text.x=element_text(angle=35, hjust=1))

grid.arrange(harmful.events.plot, harmful.events.sound.plot, ncol=2)

We consider the top 5 results in terms of the overall dataset and sound data subset, in decreasing order of threat to human health:

Rank Overall Data (1950-2011) Sound Data (1993-2011)
1 Tornado Tornado
2 Excessive Heat Excessive Heat
3 Tstm Wind Flood
4 Flood Lightning
5 Lightning Tstm Wind

By considering more complete observations in the second dataset, we seem to have more accurately identified the top 5 greatest threats to human health.

Most Economically Damaging Weather Events

The top ten weather related events that have caused the most economic damage - sum of property damage and crop damage - across the USA may be seen here, color coded by cost of damage:

#ggplot economic damage
econ.damage.plot <- ggplot(data=Econ.Damage, 
                              aes(x=reorder(EVTYPE, -econ.damage), 
                                  y=format(econ.damage, big.mark=","),
                                  fill=scale(-econ.damage)))+
        geom_bar(stat="identity") + 
        theme_bw() +
        labs(list(title="Damaging Events \n Across US (1950-2011)",
                  x="Type of Event",
                  y="Total Economic Damage ($) \n (Sum of Property & Crop Damage)")) +
        theme(legend.position="none")+
        theme(axis.text.x=element_text(angle=45, hjust=1))

#sound data
econ.damage.sound.plot <- ggplot(data=Econ.Damage.sound, 
                              aes(x=reorder(EVTYPE, -econ.damage), 
                                  y=format(econ.damage, big.mark=","),
                                  fill=scale(-econ.damage)))+
        geom_bar(stat="identity") + 
        theme_bw() +
        labs(list(title="Damaging Events \n Across US (1993-2011)",
                  x="Type of Event",
                  y="")) +
        theme(legend.position="none")+
        theme(axis.text.x=element_text(angle=45, hjust=1))

grid.arrange(econ.damage.plot, econ.damage.sound.plot, ncol=2)

Here, we consider the top 5 results in terms of the overall dataset and sound data subset, in decreasing order of economic damage:

Rank Overall Data (1950-2011) Sound Data (1993-2011)
1 Flood Flood
2 Hurricane/Typhoon Hurricane/Typhoon
3 Tornado Storm Surge
4 Storm Surge Tornado
5 Hail Hail

By considering more complete observations in the second dataset, we again seem to have more accurately identified the top 5 most devastating weather events in terms of property and crop damage across the USA.

End of Report