Goal of the Analysis

This analysis will explore the NOAA Storm Database in order to find the consequences of different types of severe weather events across the United States, specifically the economic and public health damages of specific events.

Synopsis

This data analysis will seek to determine the most damaging weather events in the United States since 1950 by using the NOAA Storm Database. This database data pertaining to date/time of events, event type, location, and severity. Much of this data is not specifically pertinent to the goals of this analysis. Specifically, this analysis will seek to determine the economic and public health consequences of EVTYPES. In an appendix following the main analysis, there will be an additional preliminary analysis that will seek to handle some of the shortcomings of the data by grouping similar EVTYPES into like “terms” which will be evaluated by location and severity.

Data Processing

I opted to read in all of the data from the data set initially to allow for analysis of different subcategories of EVTYPE variable. By entering the entire data set, it is possible to analyze EVTYPE by location, severity, date, season, etc. In an attempt to speed the process of loading the data, I added colClasses to read.table function. After this, the only transformation neccessary at this initial phase is transforming date columns to date/time vectors.

strm_file <- "storm_data.csv.bz2"

if(!file.exists(strm_file)|!file.exists("storm_documentation.pdf")){
    download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", destfile = strm_file)
    download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf", "storm_documentation.pdf")
}

classes <- c("numeric", rep("character", 3), 
             "numeric", rep("character", 3), 
             "numeric", rep("character", 4), 
             "numeric", "character", "numeric",
             "character","character", "numeric",
             "numeric", "character", 
             rep("numeric", 4), "character", "numeric", 
             rep("character", 4), rep("numeric", 4), 
             "character", "numeric")
strm <- read.table(strm_file, colClasses = classes, header = T, sep=",", strip.white = T, encoding = "latin1")

    #cleaning dates
    strm$BGN_DATE <- as.Date(strm$BGN_DATE, "%m/%d/%Y")
    strm$END_DATE <- as.Date(strm$END_DATE, "%m/%d/%Y")

Though the initial data set is mostly clean at this point, there are still lowercase values amidst the other upper case values. For that, I used the function fix.chars() to find all columns with the class “character” and transform them to upper case. This will make it easier when filtering through values in the subset made below.

fix.chars <- function(df){
    chars <<- which(sapply(df, function(x){class(x)=="character"}))
    c_cols <- df[,chars]
    new_cols <- sapply(c_cols, toupper)
}

Weather Events Most Harmful to Population Health

In order to answer which weather events are most harmful to population health, it’s necessary to subset which data are pertinent to this question. After that, the data is cleaned using the fix.chars() function created above.

fat <- subset(strm, FATALITIES | INJURIES > 0) # only 0.77%
fat[,chars] <- fix.chars(fat)
fat <- data.frame(STATE = fat$STATE, COUNTYNAME=fat$COUNTYNAME,
                  EVTYPE=fat$EVTYPE, INJURIES=fat$INJURIES, 
                  FATALITIES=fat$FATALITIES, TOTDMG=fat$INJURIES+fat$FATALITIES)

First, let’s look at the top 5 most deadly EVTYPES, then those that cause the most injuries:

dmg <- tapply(fat$FATALITIES, fat$EVTYPE, sum)
head(sort(dmg, decreasing = T), 5)
##        TORNADO EXCESSIVE HEAT    FLASH FLOOD           HEAT      LIGHTNING 
##           5633           1903            978            937            816
dmg <- tapply(fat$INJURIES, fat$EVTYPE, sum)
head(sort(dmg, decreasing = T), 5)
##        TORNADO      TSTM WIND          FLOOD EXCESSIVE HEAT      LIGHTNING 
##          91346           6957           6789           6525           5230

There is quite a bit of overlap. Next, let’s plot total injuries+fatalities to see which has the highest sum and median of damage to public health:

dmg_fat <- tapply(fat$TOTDMG, fat$EVTYPE, sum)
head(sort(dmg_fat, decreasing = T), 5)
##        TORNADO EXCESSIVE HEAT      TSTM WIND          FLOOD      LIGHTNING 
##          96979           8428           7461           7259           6046
library(ggplot2)
top_5_dmg <- subset(fat, EVTYPE %in% c(names(head(sort(dmg_fat, decreasing = T), 5))))
pl <- ggplot(data=top_5_dmg, 
             aes(x=EVTYPE,
                 y=log(TOTDMG)))
pl + geom_boxplot(aes(fill=EVTYPE))+scale_fill_brewer()


Of the top 5 most damaging EVTYPES to public health, tornado has the highest median, sum, and maximum outliers. This would suggest that tornados are the most damaging to public health.

Problem with EVTYPE analysis

There is a problem with the data, namely, that there is a high degree of synonymy in the levels of EVTYPE. For example:

head(levels(fat$EVTYPE))
## [1] "AVALANCE"     "AVALANCHE"    "BLACK ICE"    "BLIZZARD"    
## [5] "BLOWING SNOW" "BRUSH FIRE"

This shows that there are clearly related items in the levels of EVTYPE (even clerical errors like “AVALANCE” and “AVALANCHE”) which affect the data. This is a difficult problem to handle. An initial attempt to find a better estimate of the most damaging type of weather event can be found in the appendix to this report, Handling Synonymy of EVTYPES.

Weather Events with Highest Economic Consequences

First, it is necessary to transform all values for damage based on whether the measurement found in the …EXP columns.

ind <- strm$PROPDMGEXP == "M"| strm$PROPDMGEXP == "m"
strm$PROPDMG[ind] <- strm$PROPDMG[ind]*1e+06
ind <- strm$PROPDMGEXP == "K"| strm$PROPDMGEXP == "k"
strm$PROPDMG[ind] <- strm$PROPDMG[ind]*1000
ind <- strm$PROPDMGEXP == "B"|strm$PROPDMGEXP == "b"
strm$PROPDMG[ind] <- strm$PROPDMG[ind]*1e+09

ind <- strm$CROPDMGEXP == "M"| strm$CROPDMGEXP =="m"
strm$CROPDMG[ind] <- strm$CROPDMG[ind]*1e+06
ind <- strm$CROPDMGEXP == "K"|strm$CROPDMGEXP =="k"
strm$CROPDMG[ind] <- strm$CROPDMG[ind]*1000
ind <- strm$CROPDMGEXP == "B"| strm$CROPDMGEXP =="b"
strm$CROPDMG[ind] <- strm$CROPDMG[ind]*1e+09

The following subsets the original storm data, now transformed to all uppercase for character vectors, so that only data with positive PROPDMG and CROPDMG are present. After that, only the pertinent columns are included int he data frame.

econ <- subset(strm, PROPDMG | CROPDMG > 0)
econ[,chars] <- fix.chars(econ)
econ <- data.frame(STATE = econ$STATE, COUNTYNAME=econ$COUNTYNAME,
                   EVTYPE=econ$EVTYPE, PROPDMG = econ$PROPDMG, 
                   PROPDMGEXP = econ$PROPDMGEXP, CROPDMG = econ$CROPDMG, 
                   CROPDMGEXP = econ$CROPDMGEXP, TOTDMG = econ$PROPDMG+econ$CROPDMG)

In order to find the top 6 events for total damage since 1950, I used tapply to find the sum of property and crop damage for each event type. Note the following:

dmg_econ <- tapply((econ$TOTDMG)/1e+9, econ$EVTYPE, sum)
head(sort(dmg_econ, decreasing = T), 5)
##             FLOOD HURRICANE/TYPHOON           TORNADO       STORM SURGE 
##         150.31968          71.91371          57.35211          43.32354 
##              HAIL 
##          18.75822

These values are helpful in that they show that TORNADOS have caused the most total damage to crops and property in the data set. The following plots will show that TORNADOS also have the highest median, standard deviation, and fewest outliers for total damage of the top five most damaging weather events:

library(ggplot2)
top_5_dmg <- subset(econ, EVTYPE %in% c(names(head(sort(dmg_econ, decreasing = T), 5))))
pl <- ggplot(data=top_5_dmg, 
             aes(x=EVTYPE,
                 y=TOTDMG))
pl + geom_bar(stat="identity", aes(fill=EVTYPE))

With all of that in mind, these values, like the injuries and fatalities data, are quite misleading because of the synonymous events. In this case, that similarity between EVTYPES is evident because FLOOD and FLASH FLOOD are both similar. Perhaps, if those similar types of flooding were lumped into one category, they would create more total damage than tornados. In the following section, we will make an initial attempt to handle the synonymy of the EVTYPES.

Results

Based on this initial analysis of the NOAA Storm Data set, it seems most likely that Tornados have the highest effect on public health, but floods have the highest economic damage on the United States over the past 60+ years. In sum, they created the most combined injuries and fatalites, and the most combined property and crop damage.

# Public Health Damage
head(sort(dmg_fat, decreasing = T), 5)
##        TORNADO EXCESSIVE HEAT      TSTM WIND          FLOOD      LIGHTNING 
##          96979           8428           7461           7259           6046
# Economic Damage
head(sort(dmg_econ, decreasing = T), 5)
##             FLOOD HURRICANE/TYPHOON           TORNADO       STORM SURGE 
##         150.31968          71.91371          57.35211          43.32354 
##              HAIL 
##          18.75822

Though this seems somewhat clear from the data, there are major deficiencies in the NOAA Storm Data set because similarly labelled terms are not grouped together. So many EVTYPES pertainint to heat or flooding may actually correspond to the same sort of events. This might mean that, in reality, some other EVTYPE could have a higher result. In the following appendix, I will seek to determine if the data looks similar when EVTYPES are grouped together according to similar types of EVTYPE.

Appendix: Handling Synonymy of EVTYPES

In this section, we will attempt to group EVTYPE values together according to the similarity of the type. For example, “HEAT”, “EXCESSIVE HEAT”, and “HEAT/DROUGHT” are all clearly related. By aggregating these terms, it may give a much more clear picture of which type of event is actually most harmful.

First, see that 205 unique EVTYPES are present in the total fatalities and injuries data frame.

length(tapply(fat$TOTDMG, fat$EVTYPE, sum))
## [1] 205

We already determined that some of these are similar. Now, we must group them according to type:

library(RTextTools)
library(tm)
tm <- create_matrix(fat$EVTYPE, removeSparseTerms = 0.999, 
                    removeStopwords = TRUE, stemWords = TRUE,
                    minWordLength = 4 )    

terms <- Terms(tm)
terms
##  [1] "avalanch"         "blizzard"         "chill"           
##  [4] "cold"             "coldwind"         "current"         
##  [7] "dens"             "dust"             "excess"          
## [10] "extrem"           "fire"             "flash"           
## [13] "flood"            "hail"             "heat"            
## [16] "heavi"            "high"             "hurrican"        
## [19] "hurricanetyphoon" "lightn"           "marin"           
## [22] "rain"             "snow"             "storm"           
## [25] "stream"           "strong"           "surf"            
## [28] "surfhigh"         "thunderstorm"     "tornado"         
## [31] "tropic"           "tstm"             "urbansml"        
## [34] "wave"             "weather"          "weathermix"      
## [37] "wildfir"          "wildforest"       "wind"            
## [40] "windhail"         "winter"

Clearly, there will be issues with this list, since many of the terms can be applied to multiple categories, such as “excess”, and “extrem”. In order to handle this, trouble words can be deleted from the original data frame to give a more accurate aggregation of like terms:

mgsub <- function(pattern, replacement, x, ...) {
    if (length(pattern)!=length(replacement)) {
        stop("pattern and replacement do not have the same length.")
    }
    result <- x
    for (i in 1:length(pattern)) {
        result <- gsub(pattern[i], replacement[i], result, ...)
    }
    result
}

Now to aggregate the terms with the trouble words removed.

fat$EVTYPE <- mgsub(c("high", "record", "heavy", "flash", "excessive", 
                      "weather", "extreme", "wave", "wild", "dense",
                      "road", "strong"), 
                    rep("",12), fat$EVTYPE, ignore.case=T)
tm <- create_matrix(fat$EVTYPE, removeSparseTerms = 0.999, 
                    removeStopwords = TRUE, stemWords = TRUE,
                    minWordLength = 4 )    

terms <- Terms(tm)
terms
##  [1] "avalanch"         "blizzard"         "chill"           
##  [4] "cold"             "coldwind"         "current"         
##  [7] "dust"             "fire"             "flood"           
## [10] "forest"           "hail"             "heat"            
## [13] "hurrican"         "hurricanetyphoon" "lightn"          
## [16] "marin"            "rain"             "snow"            
## [19] "storm"            "stream"           "surf"            
## [22] "thunderstorm"     "tornado"          "tropic"          
## [25] "tstm"             "urbansml"         "wind"            
## [28] "windhail"         "winter"

Next, it is necessary to place each of the EVTYPES in the fatalities/injuries subset into the category of the closest corresponding ‘term’:

# this groups all of the tm words into term-groups 
group.terms <- function(x){
    grouped_terms <- c()
    for (i in terms){
        grouped_terms[i] <- list(agrep(i, x$EVTYPE, max.distance = .001, cost=2, ignore.case = T))
    }
    return(grouped_terms)
}
# usage: group.terms(inj), gets rows corresponding with each term-group

# extracts the value (e.g. "fatalities" from the data frame based on )
groupings <- function(term.group, value, df){
    value <- toupper(value)
    c(df[[value]][term.group])
}

fat_terms <- group.terms(fat)
fat_by_term <- lapply(fat_terms, groupings, value="totdmg", df=fat)
sort(unlist(lapply(fat_by_term, sum)), decreasing = TRUE)
##          tornado             wind             heat            flood 
##            97068            15197            12362            10129 
##             tstm            storm           lightn     thunderstorm 
##             7609             7326             6052             2690 
##           winter             hail             fire         hurrican 
##             2247             1790             1748             1463 
## hurricanetyphoon             snow          current         blizzard 
##             1339             1335             1106              907 
##             cold           forest             rain             dust 
##              771              557              532              525 
##             surf           tropic         avalanch            chill 
##              487              449              396              278 
##         coldwind           stream         urbansml            marin 
##              257              108              107              106 
##         windhail 
##              100

Problems with using Levenshtein distance for Language Processing

Though we’re clearly able to get a rough approximation for which terms are similarly grouped, there are obvious limitations with using the tm() and agrep() functions, which both depend heavily on the Levenshtein, or edit distance method to determon synonymy. For example, the distance between “cold” and “record” is quite close.

library(stringdist)
stringdist("cold", c("record cold", "record heat", "heat"), method = "lv")
## [1] 7 8 4
stringdist("heat", c("record cold", "record heat", "heat"), method = "lv")
## [1] 10  7  0
# sd of edit distance for tornado:
sd(stringdist(terms[1], fat$EVTYPE[fat_terms$tornado], method = "lv"))
## [1] 0.283046
#sd of edit distance for heat
sd(stringdist(terms[1], fat$EVTYPE[fat_terms$heat], method = "lv"))
## [1] 0.2307553
#sd of edit distance for wind
sd(stringdist(terms[1], fat$EVTYPE[fat_terms$wind], method = "lv"))
## [1] 3.581688

Notice that “record heat” and “record cold” are quite similar in distance to “cold”. So both would clearly be lumped into the “cold” group, while “heat” would also lump in “record heat”. Thus there is a double counting, and a miss approximation of the synonymy for the “cold” category. Yet, the standard deviation of sdist in the “tornado” category is quite small, so the “tornado” category could be considered quite accurate. This means that, though our aproximation is rough at best, it is still enough to point out the general trend: tornados have been the most harmful to population health over the past 60+ years in the united states. Not only are the the most likely to cause fatalities and injuries, they are also the most widespread in their reach.

Choropleth Map

A heplful figure to illustrate the widespread affect of tornados on population

#### GET FIPS  ####
fips <- read.table(url("http://www2.census.gov/geo/docs/reference/codes/files/national_county.txt"), sep = ",", quote = "", colClasses = c("character", "integer", rep("character", 3)))
for(i in 1:nrow(fips)){
    fips$fips[i] <- paste(fips$V2[i], fips$V3[i], sep = "")
}
fips$fips <- as.numeric(fips$fips)

fips$V4 <- toupper(fips$V4)
fips$V4 <- gsub("county", "", fips$V4, ignore.case = T)
fips$V4 <- gsub(".$", "", fips$V4, ignore.case = T)

for(i in 1:nrow(fips)){
    fips$names[i] <- paste(fips$V4[i], fips$V1[i], sep = ", ")
}

#### FORMATTING FOR CHOROPLETH MAP ####
fat$names <- apply(fat[,c("COUNTYNAME", "STATE")], 1, paste, collapse=", ")

for(i in 1:nrow(fat)){
    if(fat$names[i] %in% fips$names) fat$region[i] <- fips$fips[grep(fat$names[i], fips$names, fixed = TRUE)]
}
fat$region <- as.numeric(fat$region)
fat$value <- fat$FATALITIES

#### GRAPH CHOROPLETH MAP ####

library(choroplethr)
library(ggplot2)
library(viridis)
library(choroplethrMaps)
choro_map <- data.frame(region=fat[fat_terms$tornado,]$region, value=fat[fat_terms$tornado,]$value)
choro_map <- aggregate(choro_map$value ~ choro_map$region, FUN =sum)
names(choro_map) <- c("region", "value")

choro = CountyChoropleth$new(choro_map)
choro$title = "Total Tornado Fatalities"
choro$set_num_colors(1)
choro$ggplot_polygon = geom_polygon(aes(fill = value), color = NA)
choro$ggplot_scale = scale_fill_gradientn(name = "Fatalities & Injuries", colours = viridis(32), limits = c(0, 30))
choro$render()