Synopsis

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

What I discovered is that Floods are the most deadly and costly in respect to population health and economic consequences. Tornadoes, Extreme Heat, Extreme
Cold and Hurricanes also lead to an increase in fatalities, injuries, property damage and crop damage. This is especially relevant because currently (2021) countries around the globe are experiencing these most deadly weather events. We are seeing loss of life, injuries, and billions/trillions in property damage. Analyzing historical data on weather events could help us determine the areas where resources are needed most to prevent both loss of life and property damage.

Data Processing

Loading Storm Data

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(stringdist)
library(stringr)

#Uncomment for download.
#download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",destfile="./repdata_data_StormData.csv.bs2",method ="curl")

if(!file.exists("./repdata_data_StormData.csv")){
  unzip(zipfile="./repdata_data_StormData.csv.bs2")
}
stormData <- read.csv("repdata_data_StormData.csv")
#summary(stormData)

Formating and Filtering Data

Format BGN_DATE column as a Date. Filter the data for date ‘/1/1/96’ and beyond because it contains less incomplete data.

#I am selecting only the dates and data that is complete. 
stormData$BGN_DATE <- as.POSIXct(stormData$BGN_DATE, format="%m/%d/%Y %H:%M:%S")
stormData <- stormData %>% filter(BGN_DATE >= as.POSIXct('1/1/1996', format="%m/%d/%Y")) 

Formatting of Property Damage and Crop Damage and Other Variables

Four columns in the database: PROPDMG and PROPDMGEXP, and CROPDMG and CROPDMGEXP represent the Total Property and Crop Damage.

The *EXP columns represent the multiplier for the base dollar amount, and can take on one of three values in the filtered data set: K (Thousands), M (Millions), or B (Billions) based on US dollars.

These values were converted to the appropriate multiplier:
(1E3 for thousands, 1E6 for millions, or 1E9 for billions)

Two new columns, PROPDMG_AMT, and CROPDMG_AMT were generated by
multiplying the raw dollar amount by the multipler generated by the mult
function defined below:

mult <<- function(x) { if(x == "K") 1E3  
                       else if(x == "M") 1E6  
                       else if(x == "B") 1E9  
                       else 0}

#Two new columns are created PROPDMG_COST and CROPDMG_COST
#EVTYPE is changed to uppercase for clustering
stormData <-  mutate(stormData,  
                   EVTYPE = str_trim(toupper(EVTYPE)),
                   PROPDMGEXP = PROPDMG*sapply(PROPDMGEXP, FUN = mult),
                   CROPDMGEXP = CROPDMG*sapply(CROPDMGEXP, FUN = mult))


# Create new fields to hold the Dollar Amounts  for Property and Crop Damage
stormData <- stormData %>% 
  mutate(PROPDMG_AMT = PROPDMG*PROPDMGEXP, CROPDMG_AMT = CROPDMG*CROPDMGEXP) %>%     arrange(BGN_DATE)
  
  
#explore filtered data
stormData <- stormData %>% 
  select(BGN_DATE, COUNTY, STATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP, PROPDMG_AMT, CROPDMG_AMT)
summary(stormData)
##     BGN_DATE                       COUNTY          STATE          
##  Min.   :1996-01-01 00:00:00   Min.   :  1.00   Length:653530     
##  1st Qu.:2000-11-21 00:00:00   1st Qu.: 29.00   Class :character  
##  Median :2005-05-14 00:00:00   Median : 73.00   Mode  :character  
##  Mean   :2004-10-25 14:47:18   Mean   : 99.89                     
##  3rd Qu.:2008-08-22 00:00:00   3rd Qu.:129.00                     
##  Max.   :2011-11-30 00:00:00   Max.   :873.00                     
##     EVTYPE            FATALITIES           INJURIES           PROPDMG       
##  Length:653530      Min.   :  0.00000   Min.   :0.00e+00   Min.   :   0.00  
##  Class :character   1st Qu.:  0.00000   1st Qu.:0.00e+00   1st Qu.:   0.00  
##  Mode  :character   Median :  0.00000   Median :0.00e+00   Median :   0.00  
##                     Mean   :  0.01336   Mean   :8.87e-02   Mean   :  11.69  
##                     3rd Qu.:  0.00000   3rd Qu.:0.00e+00   3rd Qu.:   1.00  
##                     Max.   :158.00000   Max.   :1.15e+03   Max.   :5000.00  
##    PROPDMGEXP           CROPDMG          CROPDMGEXP         PROPDMG_AMT       
##  Min.   :0.000e+00   Min.   :  0.000   Min.   :0.000e+00   Min.   :0.000e+00  
##  1st Qu.:0.000e+00   1st Qu.:  0.000   1st Qu.:0.000e+00   1st Qu.:0.000e+00  
##  Median :0.000e+00   Median :  0.000   Median :0.000e+00   Median :0.000e+00  
##  Mean   :5.612e+05   Mean   :  1.839   Mean   :5.318e+04   Mean   :5.741e+07  
##  3rd Qu.:1.250e+03   3rd Qu.:  0.000   3rd Qu.:0.000e+00   3rd Qu.:1.562e+03  
##  Max.   :1.150e+11   Max.   :990.000   Max.   :1.510e+09   Max.   :1.322e+13  
##   CROPDMG_AMT       
##  Min.   :0.000e+00  
##  1st Qu.:0.000e+00  
##  Median :0.000e+00  
##  Mean   :9.887e+06  
##  3rd Qu.:0.000e+00  
##  Max.   :3.552e+11

Determine Values for EVTYPE

Event Type(EVTYPE) is not “tidy data”. It contains many variations of the 48 Event Types defined in the documentation. I will use several methods to trim these variations down to a managable size by clustering multiple instances of the same EVTYPE based on similarites.

  1. I changed the EVTYPE values to uppercase and trimmed the front and end.
  2. I will limit it to the unique events.
  3. I will then apply Hiearchial Clustering using stringdist functions.
EVTYPES <- unique(stormData$EVTYPE)
print(paste("There are",length(EVTYPES),"unique Event Types."))
## [1] "There are 430 unique Event Types."

After looking over a few RPubs based on this dataset, I studied a couple of tutorials on K-Means clustering and Hiearchial Clustering. This is what I found: The metric you need to choose for an application strongly depends on both the nature of the string (what does the string represent?) and the cause of dissimilarities between the strings you are measuring. For example, if you are comparing human-typed names that may contain typo’s or misspellings, the Jaro-Winkler distance may be of use. Our data column EVTYPE seems to have this problem, as there are many variations of the 48 events defined, but after running unique(EVTYPE) there are 985 on the filtered data alone. This alone was a large number of events to analyze in a single report. So, I decided to explore the use stringdist package and the Jaro-Winkler distance metric to create a distance matrix. The string distance is normalized, so a single metric will work for strings of various lengths such as we have in our data set. The function stringdistmatrix (part of the stringdist package) will use: The Jaro-Winkler distance (method=jw, 0<p<=0.25) adds a correction term to the Jaro-distance. It is defined as d - lpd, where d is the Jaro-distance. Here, l is obtained by counting, from the start of the input strings, after how many characters the first character mismatch between the two strings occurs, with a maximum of four. The factor p is a ‘prefix’ factor, which in the work of Winkler is often chosen 0.1.

distance_matrix <- stringdistmatrix(EVTYPES,EVTYPES,method = "jw",p=0.1)
# I then put the names back into the distance matrix to plot it. 
rownames(distance_matrix) <- EVTYPES
HC_EVTYPES <- hclust(as.dist(distance_matrix),method="ward.D2")


# I tried several cluster heights(0.12,0.14,0.16,etc.) and finally selected  
# clusters at a height of 0.16 in the hierarchical tree. However I later decided
# to try using k=48 to force the cluster size to 48, which was the initial defined
# number of Event Types
#EVCUT_016 <- cutree(HC_EVTYPES, h=0.16)
EVCUT_048 <- cutree(HC_EVTYPES, k=48)
EVCUT_048 <- as.data.frame(EVCUT_048)
EVCUT_048$Event_Type <- attr(EVCUT_048, 'row.names')
colnames(EVCUT_048) <- c("Cluster","Event_Type")

stormData <- merge(stormData, EVCUT_048, by.x="EVTYPE", by.y="Event_Type", all.x=T, all.y=F)

# clusterData will allow me to show info such as names and counts for a cluster  
clusterData <- stormData %>% group_by(EVTYPE, Cluster) %>% summarize(COUNT = n()) %>% 
   ungroup() %>% 
   arrange(Cluster)
## `summarise()` has grouped output by 'EVTYPE'. You can override using the `.groups` argument.
summary(clusterData)
##     EVTYPE             Cluster          COUNT       
##  Length:430         Min.   : 1.00   Min.   :     1  
##  Class :character   1st Qu.:11.00   1st Qu.:     1  
##  Mode  :character   Median :23.00   Median :     2  
##                     Mean   :21.49   Mean   :  1520  
##                     3rd Qu.:30.00   3rd Qu.:    12  
##                     Max.   :48.00   Max.   :207715
print(paste("There are now ",max(clusterData$Cluster)," Clusters."))
## [1] "There are now  48  Clusters."

Results

  1. Across the United States, which types of events (as indicated by the EVTYPE variable) are most harmful with respect to population health? Below is a bar-graph that shows the Top 10 most harmful events with respect to population health. Namely, Excessive Cold, Thundersnow Shower, and Floods.
# Prepare a summarized version of the Totals for FATALITIES, INJURIES, PROPDMG,
# and CROPDMG for plotting
stormData_summarized <- stormData %>% group_by(Cluster) %>% 
  summarize(EVTYPE = first(EVTYPE), COUNT = n(), FATALITIES = sum(FATALITIES, na.rm=T), 
            INJURIES = sum(INJURIES, na.rm=T), PROPDMG = sum(PROPDMG_AMT, na.rm=T), 
            CROPDMG = sum(CROPDMG_AMT, na.rm=T)) %>% ungroup()

# Reorder the data for the plotting 
stormData_Fatalities <- transform(stormData_summarized, EVTYPE = reorder(EVTYPE,FATALITIES)) %>% 
  arrange(desc(FATALITIES))

stormData_Injuries <- transform(stormData_summarized, EVTYPE = reorder(EVTYPE,INJURIES)) %>% 
  arrange(desc(INJURIES))

# Plot the most important data regarding the effects on the Public Health
ggplot(data=stormData_Fatalities[1:10,], aes(EVTYPE, FATALITIES)) +  
  geom_bar(stat = "identity",aes(fill = Cluster),width=0.7,
    position =  position_dodge(width = 1.0)) + coord_flip() +
    theme_bw() +
    labs(x="Event Types", y="Top 10 Fatal Storm Events")+
    labs(title="US Fatalities (1996-2011) per Storm Event")

ggplot(data=stormData_Injuries[1:10,], aes(EVTYPE, INJURIES)) +
  geom_bar(stat = "identity",aes(fill = Cluster),width=0.7,
    position = position_dodge(width = 1.0)) + coord_flip() +
    theme_bw() +
    labs(x="Event Types", y="Top 10 Injurious Storm Events") +
    labs(title="US Injuries (1996-2011) per Storm Event")

  1. Across the United States, which types of events have the greatest economic consequences? Below is a bar-graph showing the Top 10 Events that have the greatest economic consequences. Namely Excessive Cold, Floods and Thundersnow
    Showers.
stormData_Cropdmg <- transform(stormData_summarized, EVTYPE = reorder(EVTYPE,CROPDMG)) %>% 
  arrange(desc(CROPDMG))
stormData_Propdmg <- transform(stormData_summarized, EVTYPE = reorder(EVTYPE,PROPDMG)) %>% 
  arrange(desc(PROPDMG))

# Plot results on Property Damage
ggplot(data=stormData_Propdmg[1:10,], aes(EVTYPE, PROPDMG)) +
  geom_bar(stat = "identity",aes(fill = Cluster),width=0.7,
    position = position_dodge(width = 1.0)) + coord_flip() +
    theme_bw() +
    labs(x="Event Types", y="Top 10 Property Damage Storm Events") +
    labs(title="US Property Damage-1996-2011 from Storm Event")