Synopsis

Analysis was performed major weather event data from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database during the period from 1950 to November 2011 to find the weather event types that cause the most human-related and economic damages in the United States. Since the data collection process and availability has change significantly over the past 60 years, the data available, particularly the event type data, contains vast amount of inconsistencies and discrepancies. Pattern matching and regular expression was used to clean up the data as best as possible, but much of the merging choices were arbitrarily determined by the author.

We defined the measure of human-related damage to be the sum of fatalities and injuries, and that of economic damage to be the sum of property and crop damages. The different event types are ranked according to these metrics and the top 10 categories are displayed in charts in the Results section.

The top weather event types in terms of human-related damages caused are Tornado, Excessive Heat, and Thunderstorm Wind with a total of 116,063 death/injuries altogether, while the top categories for economic damages are Flood, Hurricane, and Tornado with over $310 Billion incurred.

Data Processing

Download the Storm Data and unzip it if the file is not in the working directory. The bunzip2 function in the R.utils package is used to unzip the bz2 file. The resulting csv file is then loaded into the data data frame.

# download the data from the source and unzip it
if(!file.exists("repdata-data-StormData.csv.bz2")){
    message("Downloading data")
    download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", 
                  destfile = "repdata-data-StormData.csv.bz2", method = "auto")
    
    if(!"R.utils" %in% rownames(installed.packages())){
        install.packages("R.utils")
        }
    library(R.utils)
    
    # unzip bz2 file
    bunzip2("repdata-data-StormData.csv.bz2", remove = FALSE)
    }

# Load the csv into data frame
data <- read.csv("repdata-data-StormData.csv", header = TRUE)

There are total of 902297 observations of 37 variables. Since we are only interested the events that are pertain the population health and economic damage, most of the variables are not applicable here. Hence it makes sense to subset the data to just what we require. For the purpose of this project, we will use only the following variables:

# subset the dataset to only the above variables
subset <- data[, c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", 
                   "CROPDMG", "CROPDMGEXP")]

# summary of the subset of data
summary(subset)
##                EVTYPE         FATALITIES          INJURIES        
##  HAIL             :288661   Min.   :  0.0000   Min.   :   0.0000  
##  TSTM WIND        :219940   1st Qu.:  0.0000   1st Qu.:   0.0000  
##  THUNDERSTORM WIND: 82563   Median :  0.0000   Median :   0.0000  
##  TORNADO          : 60652   Mean   :  0.0168   Mean   :   0.1557  
##  FLASH FLOOD      : 54277   3rd Qu.:  0.0000   3rd Qu.:   0.0000  
##  FLOOD            : 25326   Max.   :583.0000   Max.   :1700.0000  
##  (Other)          :170878                                         
##     PROPDMG          PROPDMGEXP        CROPDMG          CROPDMGEXP    
##  Min.   :   0.00          :465934   Min.   :  0.000          :618413  
##  1st Qu.:   0.00   K      :424665   1st Qu.:  0.000   K      :281832  
##  Median :   0.00   M      : 11330   Median :  0.000   M      :  1994  
##  Mean   :  12.06   0      :   216   Mean   :  1.527   k      :    21  
##  3rd Qu.:   0.50   B      :    40   3rd Qu.:  0.000   0      :    19  
##  Max.   :5000.00   5      :    28   Max.   :990.000   B      :     9  
##                    (Other):    84                     (Other):     9
# structure of the subset of data
str(subset)
## 'data.frame':    902297 obs. of  7 variables:
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...

There is little ambiguity with the data for FATALITIES, INJURIES, PROPDMG, and CROPDMG variables as they contain numeric values. The factor variables are quite a bit more convoluted. Quick tables for PROPDMGEXP and CROPDMGEXP show that they contain various symbols, letters, and numbers:

# tabulate PROPDMGEXP
table(subset$PROPDMGEXP)
## 
##             -      ?      +      0      1      2      3      4      5 
## 465934      1      8      5    216     25     13      4      4     28 
##      6      7      8      B      h      H      K      m      M 
##      4      5      1     40      1      6 424665      7  11330
# tabulate CROPDMGEXP
table(subset$CROPDMGEXP)
## 
##             ?      0      2      B      k      K      m      M 
## 618413      7     19      1      9     21 281832      1   1994

By far, the largest group for both variables are blanks. Since there are no numeric values that are available, We will assume that these values, along with symbols “?”, “+”, and “-” to be zero.

Note : this is because since we don’t have the data anyway, these data points will likely not be part of our analysis. We set these missing values to zero to make calculations easier to carry out later

In addition, the letters used here signify different monetary value: “B” for Billion, “K” for Thousands, “H” for Hundreds and “M” for Million (Section 2.7 from Storm Data Documentation). We can therefore replace “B”, “K”, “H”, and “M” with 9, 3, 2, and 6 respectively to make calculations easier.

# replace "", +, -, and ? with NA for PROPDMGEXP
pdmg <- subset$PROPDMGEXP
pdmg[pdmg == "" | pdmg == "+" | pdmg == "-" | pdmg == "?"] <- factor(0)

# replace letters with factor values
levels(pdmg)[levels(pdmg)=="B"] <- "9"
pdmg[pdmg == "K"] <- factor(3)
pdmg[pdmg == "H" | pdmg == "h"] <- factor(2)
pdmg[pdmg == "M" | pdmg == "m" ] <- factor(6)

# assign back to PROPDMGEXP as a numeric variable
subset$PROPDMGEXP <- as.numeric(as.character(pdmg))

# replace "", +, -, and ? with NA for PROPDMGEXP
cdmg <- subset$CROPDMGEXP
cdmg[cdmg == "" | cdmg == "+" | cdmg == "-" | cdmg == "?"] <- 0

# replace letters with factor values
levels(cdmg)[levels(cdmg)=="B"] <- "9"
levels(cdmg)[levels(cdmg)=="K"] <- "3"
levels(cdmg)[levels(cdmg)=="M"] <- "6"
cdmg[cdmg == "k"] <- factor(3)
cdmg[cdmg == "m"] <- factor(6)

# assign back to PROPDMGEXP as a numeric variable
subset$CROPDMGEXP <- as.numeric(as.character(cdmg))

As for the EVTYPE factor variable, there are 985 levels, the following are some precedures to clean the categorical data.

# convert EVTYPE to uppercase letters, trim and remove extra space
if(!"R.utils" %in% rownames(installed.packages())){
    install.packages("R.utils")
    }
library(stringr)
et <- subset$EVTYPE
et <- str_trim(toupper(et))
et <- gsub("  ", " ", et)

# find all characters behind symbols and remove them
et <- gsub("(^.+) *[\\\\/\\(\\)]", "\\1", et)

# remove all number suffix
et <- gsub(" *[A-Z]?[0-9]+.*$", "", et)

# remove "S" from plural form and change -ies to -y
et <- gsub("IES$", "Y", et)
et <- gsub("(.*[^SI])[S]$", "\\1", et)

# remove any remaining symbols
et <- gsub(" ?[^A-Z -] *.*", "", et)

The next step in cleanning up the categorical data is combining similar terms. Here, I am using the 48 Storm Data Event (Section 2.1.1 from Storm Data Documentation) as a guide.

Note: This is by no means perfect and will incorporate some of my personal interpretation and decisions

# avalanche
et[grep("AVAL", et)] <- "AVALANCHE"

# blizzard
et[grep("BLIZZ", et)] <- "BLIZZARD"

# coastal flood
et[grep("COAST", et)] <- "COASTAL FLOOD"

# cold/wind chill
et[grep("^(COLD|COOL|WIND)", et)] <- "COLD/WIND CHILL"

# drought
et[grep("^(DRY|DROU)", et)] <- "DROUGHT"

# DUST DEVIL
et[grep("DEV", et)] <- "DUST DEVIL"

# DUST STORM
et[grep("(DUST( ?)S)|DUST$", et)] <- "DUST DEVIL"

# extreme cold
et[grep("EX.*CO|REC.* CO|UN.* CO", et)] <- "EXTREME COLD/WIND CHILL"

# excessive heat
et[grep("EX.*HE|REC.* HE|UN.* [DHW]", et)] <- "EXCESSIVE HEAT"

# flash flood
et[grep("FLASH", et)] <- "FLASH FLOOD"

# flood
et[grep("^[^FLC].*FLOOD", et)] <- "FLOOD"

# freeze
et[grep("FREEZ", et)] <- "FROST/FREEZE"

# hail
et[grep("HAIL", et)] <- "HAIL"

# heavy rain
et[grep("HEAV.*RA|PREC|REC.*RA", et)] <- "HEAVY RAIN"

# heavy snow
et[grep("HEAV.*S|^SNO", et)] <- "HEAVY SNOW"

# high wind
et[grep("HIGH.*W", et)] <- "HEAVY WIND"

# hurricane
et[grep("HURR|TYPH", et)] <- "HURRICANE"

# ice storm
et[grep("^ICE", et)] <- "ICE STORM"

# lightning
et[grep("LIGH.*NG", et)] <- "LIGHTNING"

# thunderstorm wind
et[grep("THUN|^TSTM", et)] <- "THUNDERSTORM WIND"

# tornado
et[grep("TORN|^WATER", et)] <- "TORNADO"

# winter storm
et[grep("^WINT.*S", et)] <- "WINTER STORM"

# winter weather
et[grep("^WINT.*[MW]", et)] <- "WINTER WEATHER"

# reassign to EVTYPE
subset$EVTYPE <- et

# number of factors still in EVTYPE
length(unique(subset$EVTYPE))
## [1] 274

At this point, though the factor variable EVTYPE still contains around 270 factors, the largest clusters, and thus most likely to be causing the highest damage, have been defined. Now, the data is ready to be analyzed.

Data Analysis

The data analysis will focus on the following two questions:

1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

To answer this question, we should take a look at combination of FATALITIES and INJURIES data to construct a complete picture about the human-related damage these weather event incur.

We will create a new column called “humanDamage” to the subset data frame to capture the sum of FATALITIES and INJURIES variables. Subsequently, we can find out about the total sum of human damage for each type of weather event.

# create new column "humanDamage"
subset$humanDamage <- subset$FATALITIES + subset$INJURIES

# calculate total damage across different events and record in decreasing order
totHumanDamage <- sort(tapply(subset$humanDamage, subset$EVTYPE, sum), 
                       decreasing = T)

2. Across the United States, which types of events have the greatest economic consequences?

Similarly, to answer this question, we have to first calculate the total economic damage that each type of weather event causes. We will create a new column called “economicDamage” using the PROPDMG, PROPDMGEXP, CROPDMG, and CROPDMGEXP variables

# create new column "economicDamage"
subset$economicDamage <- subset$PROPDMG*10^subset$PROPDMGEXP + 
    subset$CROPDMG*10^subset$CROPDMGEXP

# calculate total damage across different events and record in decreasing order
totEconomicDamage <- sort(tapply(subset$economicDamage, subset$EVTYPE, sum), 
                          decreasing = T)

Results

Total Human Damage

# plot totHumanDamage
library(lattice)
barchart(totHumanDamage[10:1], xlab = "Fatalities + Injuries (person)", 
         col = "steelblue",
         main = "Top 10 Weather Event Types with the Highest Human Damage", 
         xlim = c(0, max(totHumanDamage)*1.2),
         panel = function(...){
             args <- list(...)
             panel.text(args$x, args$y, format(args$x, big.mark = ",", 
                                               scientific = FALSE), 
                        pos = 4, offset = 1, cex = 0.75)
             panel.barchart(...)
             })

As we can see from above, the 10 weather event types that cause the most human-related damage, namely injuries and fatalities, from 1950 to 2011 are as follows:

  1. Tornado

  2. Excessive Heat

  3. Thunderstorm Wind

  4. Flood

  5. Lightning

  6. Heat

  7. Flash Flood

  8. Winter Weather

  9. Ice Storm

  10. Heavy Wind

Of these, Tornado is by far the highest at 97,100 persons killed or injured, higher than the rest of the 9 event types combined. Heat related incidents also caused about 14,500 death and injuries. Thunderstorms, lightning, and heavy winds affected around 15,000 lives. The rest of the deadliest categories are claimed by floods and winter weather/ice.

Total Economic Damage

# plot totEconomicDamage
barchart(totEconomicDamage[10:1], xlab = "Property Damage + Crop Damage ($)", 
         col = "red",
         main = "Top 10 Weather Event Types with the Highest Economic Damage",
         xlim = c(0, max(totEconomicDamage)*1.4),
         panel = function(...){
             args <- list(...)
             panel.text(args$x, args$y, paste("$", 
                                              format(args$x, big.mark = ",", 
                                                     scientific = FALSE), 
                                              sep = ""), 
                        pos = 4, offset = 1, cex = 0.75)
             panel.barchart(...)
             })

As we can see from above, the 10 weather event types that cause the most economic damage, namely property and crops, from 1950 to 2011 are as follows:

  1. Flood

  2. Hurricane

  3. Tornado

  4. Storm Surge

  5. Hail

  6. Flash Flood

  7. Drought

  8. Ice Storm

  9. Tropical Storm

  10. Heavy Wind

Of these, flood is by far the highest category at $161 Billion total damage incurred. It should be noted that a specific category that is tracked separately, Flash Flood, also caused an additional $19 Billion in damages. Hurricane comes in second at $91 Billion in total damage. As a related item, Storm Surge, which is commonly associated with hurricanes and tropical storms, is ranked number 4 on the list with $43 Billion damage incurred. The third highest event type is the tornado, which in additional to causing the most amount of human related damage, also created $59 Billion in economic damage as well. The rest of the top categories are captured by Hail, heavy tropical/ice storms, drought, and heavy winds.