Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
What I discovered is that Floods are the most deadly and costly in respect to population health and economic consequences. Tornadoes, Extreme Heat, Extreme
Cold and Hurricanes also lead to an increase in fatalities, injuries, property damage and crop damage. This is especially relevant because currently (2021) countries around the globe are experiencing these most deadly weather events. We are seeing loss of life, injuries, and billions/trillions in property damage. Analyzing historical data on weather events could help us determine the areas where resources are needed most to prevent both loss of life and property damage.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(stringdist)
library(stringr)
#Uncomment for download.
#download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",destfile="./repdata_data_StormData.csv.bs2",method ="curl")
if(!file.exists("./repdata_data_StormData.csv")){
unzip(zipfile="./repdata_data_StormData.csv.bs2")
}
stormData <- read.csv("repdata_data_StormData.csv")
#summary(stormData)
Format BGN_DATE column as a Date. Filter the data for date ‘/1/1/96’ and beyond because it contains less incomplete data.
#I am selecting only the dates and data that is complete.
stormData$BGN_DATE <- as.POSIXct(stormData$BGN_DATE, format="%m/%d/%Y %H:%M:%S")
stormData <- stormData %>% filter(BGN_DATE >= as.POSIXct('1/1/1996', format="%m/%d/%Y"))
Four columns in the database: PROPDMG and PROPDMGEXP, and CROPDMG and CROPDMGEXP represent the Total Property and Crop Damage.
The *EXP columns represent the multiplier for the base dollar amount, and can take on one of three values in the filtered data set: K (Thousands), M (Millions), or B (Billions) based on US dollars.
These values were converted to the appropriate multiplier:
(1E3 for thousands, 1E6 for millions, or 1E9 for billions)
Two new columns, PROPDMG_AMT, and CROPDMG_AMT were generated by
multiplying the raw dollar amount by the multipler generated by the mult
function defined below:
mult <<- function(x) { if(x == "K") 1E3
else if(x == "M") 1E6
else if(x == "B") 1E9
else 0}
#Two new columns are created PROPDMG_COST and CROPDMG_COST
#EVTYPE is changed to uppercase for clustering
stormData <- mutate(stormData,
EVTYPE = str_trim(toupper(EVTYPE)),
PROPDMGEXP = PROPDMG*sapply(PROPDMGEXP, FUN = mult),
CROPDMGEXP = CROPDMG*sapply(CROPDMGEXP, FUN = mult))
# Create new fields to hold the Dollar Amounts for Property and Crop Damage
stormData <- stormData %>%
mutate(PROPDMG_AMT = PROPDMG*PROPDMGEXP, CROPDMG_AMT = CROPDMG*CROPDMGEXP) %>% arrange(BGN_DATE)
#explore filtered data
stormData <- stormData %>%
select(BGN_DATE, COUNTY, STATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP, PROPDMG_AMT, CROPDMG_AMT)
summary(stormData)
## BGN_DATE COUNTY STATE
## Min. :1996-01-01 00:00:00 Min. : 1.00 Length:653530
## 1st Qu.:2000-11-21 00:00:00 1st Qu.: 29.00 Class :character
## Median :2005-05-14 00:00:00 Median : 73.00 Mode :character
## Mean :2004-10-25 14:47:18 Mean : 99.89
## 3rd Qu.:2008-08-22 00:00:00 3rd Qu.:129.00
## Max. :2011-11-30 00:00:00 Max. :873.00
## EVTYPE FATALITIES INJURIES PROPDMG
## Length:653530 Min. : 0.00000 Min. :0.00e+00 Min. : 0.00
## Class :character 1st Qu.: 0.00000 1st Qu.:0.00e+00 1st Qu.: 0.00
## Mode :character Median : 0.00000 Median :0.00e+00 Median : 0.00
## Mean : 0.01336 Mean :8.87e-02 Mean : 11.69
## 3rd Qu.: 0.00000 3rd Qu.:0.00e+00 3rd Qu.: 1.00
## Max. :158.00000 Max. :1.15e+03 Max. :5000.00
## PROPDMGEXP CROPDMG CROPDMGEXP PROPDMG_AMT
## Min. :0.000e+00 Min. : 0.000 Min. :0.000e+00 Min. :0.000e+00
## 1st Qu.:0.000e+00 1st Qu.: 0.000 1st Qu.:0.000e+00 1st Qu.:0.000e+00
## Median :0.000e+00 Median : 0.000 Median :0.000e+00 Median :0.000e+00
## Mean :5.612e+05 Mean : 1.839 Mean :5.318e+04 Mean :5.741e+07
## 3rd Qu.:1.250e+03 3rd Qu.: 0.000 3rd Qu.:0.000e+00 3rd Qu.:1.562e+03
## Max. :1.150e+11 Max. :990.000 Max. :1.510e+09 Max. :1.322e+13
## CROPDMG_AMT
## Min. :0.000e+00
## 1st Qu.:0.000e+00
## Median :0.000e+00
## Mean :9.887e+06
## 3rd Qu.:0.000e+00
## Max. :3.552e+11
Event Type(EVTYPE) is not “tidy data”. It contains many variations of the 48 Event Types defined in the documentation. I will use several methods to trim these variations down to a managable size by clustering multiple instances of the same EVTYPE based on similarites.
EVTYPES <- unique(stormData$EVTYPE)
print(paste("There are",length(EVTYPES),"unique Event Types."))
## [1] "There are 430 unique Event Types."
After looking over a few RPubs based on this dataset, I studied a couple of tutorials on K-Means clustering and Hiearchial Clustering. This is what I found: The metric you need to choose for an application strongly depends on both the nature of the string (what does the string represent?) and the cause of dissimilarities between the strings you are measuring. For example, if you are comparing human-typed names that may contain typo’s or misspellings, the Jaro-Winkler distance may be of use. Our data column EVTYPE seems to have this problem, as there are many variations of the 48 events defined, but after running unique(EVTYPE) there are 985 on the filtered data alone. This alone was a large number of events to analyze in a single report. So, I decided to explore the use stringdist package and the Jaro-Winkler distance metric to create a distance matrix. The string distance is normalized, so a single metric will work for strings of various lengths such as we have in our data set. The function stringdistmatrix (part of the stringdist package) will use: The Jaro-Winkler distance (method=jw, 0<p<=0.25) adds a correction term to the Jaro-distance. It is defined as d - lpd, where d is the Jaro-distance. Here, l is obtained by counting, from the start of the input strings, after how many characters the first character mismatch between the two strings occurs, with a maximum of four. The factor p is a ‘prefix’ factor, which in the work of Winkler is often chosen 0.1.
distance_matrix <- stringdistmatrix(EVTYPES,EVTYPES,method = "jw",p=0.1)
# I then put the names back into the distance matrix to plot it.
rownames(distance_matrix) <- EVTYPES
HC_EVTYPES <- hclust(as.dist(distance_matrix),method="ward.D2")
# I tried several cluster heights(0.12,0.14,0.16,etc.) and finally selected
# clusters at a height of 0.16 in the hierarchical tree. However I later decided
# to try using k=48 to force the cluster size to 48, which was the initial defined
# number of Event Types
#EVCUT_016 <- cutree(HC_EVTYPES, h=0.16)
EVCUT_048 <- cutree(HC_EVTYPES, k=48)
EVCUT_048 <- as.data.frame(EVCUT_048)
EVCUT_048$Event_Type <- attr(EVCUT_048, 'row.names')
colnames(EVCUT_048) <- c("Cluster","Event_Type")
stormData <- merge(stormData, EVCUT_048, by.x="EVTYPE", by.y="Event_Type", all.x=T, all.y=F)
# clusterData will allow me to show info such as names and counts for a cluster
clusterData <- stormData %>% group_by(EVTYPE, Cluster) %>% summarize(COUNT = n()) %>%
ungroup() %>%
arrange(Cluster)
## `summarise()` has grouped output by 'EVTYPE'. You can override using the `.groups` argument.
summary(clusterData)
## EVTYPE Cluster COUNT
## Length:430 Min. : 1.00 Min. : 1
## Class :character 1st Qu.:11.00 1st Qu.: 1
## Mode :character Median :23.00 Median : 2
## Mean :21.49 Mean : 1520
## 3rd Qu.:30.00 3rd Qu.: 12
## Max. :48.00 Max. :207715
print(paste("There are now ",max(clusterData$Cluster)," Clusters."))
## [1] "There are now 48 Clusters."
# Prepare a summarized version of the Totals for FATALITIES, INJURIES, PROPDMG,
# and CROPDMG for plotting
stormData_summarized <- stormData %>% group_by(Cluster) %>%
summarize(EVTYPE = first(EVTYPE), COUNT = n(), FATALITIES = sum(FATALITIES, na.rm=T),
INJURIES = sum(INJURIES, na.rm=T), PROPDMG = sum(PROPDMG_AMT, na.rm=T),
CROPDMG = sum(CROPDMG_AMT, na.rm=T)) %>% ungroup()
# Reorder the data for the plotting
stormData_Fatalities <- transform(stormData_summarized, EVTYPE = reorder(EVTYPE,FATALITIES)) %>%
arrange(desc(FATALITIES))
stormData_Injuries <- transform(stormData_summarized, EVTYPE = reorder(EVTYPE,INJURIES)) %>%
arrange(desc(INJURIES))
# Plot the most important data regarding the effects on the Public Health
ggplot(data=stormData_Fatalities[1:10,], aes(EVTYPE, FATALITIES)) +
geom_bar(stat = "identity",aes(fill = Cluster),width=0.7,
position = position_dodge(width = 1.0)) + coord_flip() +
theme_bw() +
labs(x="Event Types", y="Top 10 Fatal Storm Events")+
labs(title="US Fatalities (1996-2011) per Storm Event")
ggplot(data=stormData_Injuries[1:10,], aes(EVTYPE, INJURIES)) +
geom_bar(stat = "identity",aes(fill = Cluster),width=0.7,
position = position_dodge(width = 1.0)) + coord_flip() +
theme_bw() +
labs(x="Event Types", y="Top 10 Injurious Storm Events") +
labs(title="US Injuries (1996-2011) per Storm Event")
stormData_Cropdmg <- transform(stormData_summarized, EVTYPE = reorder(EVTYPE,CROPDMG)) %>%
arrange(desc(CROPDMG))
stormData_Propdmg <- transform(stormData_summarized, EVTYPE = reorder(EVTYPE,PROPDMG)) %>%
arrange(desc(PROPDMG))
# Plot results on Property Damage
ggplot(data=stormData_Propdmg[1:10,], aes(EVTYPE, PROPDMG)) +
geom_bar(stat = "identity",aes(fill = Cluster),width=0.7,
position = position_dodge(width = 1.0)) + coord_flip() +
theme_bw() +
labs(x="Event Types", y="Top 10 Property Damage Storm Events") +
labs(title="US Property Damage-1996-2011 from Storm Event")