Synopsis

The goal of this project is to explore data contained in the Storm Event Database from 1950 to 2011 to investigate which types of weather events are most harmful with respect to population health and which have the greatest economic consequences. The project is organised in the following sections:

Data Processing: explains how the database has been downloaded, explored and a selected number of variables has been loaded into a dataframe. The dataframe has been further processed in order to compute the economic impact of the different events and to filter the event type describing the cause of casualties and damages.
Health and Economic Costs analysis: explain how analysis has been conducted in order to detect the most harmful and most costly event types.
Results presents plots and tables summarising the findings.
Appendix and future works contains some further code used for the current project and shows futher analysis that could be of some interest.

Data Processing

The National Oceanic and Atmospheric Administration (NOAA) is an US scientific agency that focuses on “conditions of the oceans, major waterways, and atmosphere” which one of the main activity is “Monitoring and observing Earth systems with instruments and data collection networks”. The Storm Event Database contains data from 1950 to date that tracks the caracteristics of major weather events including an estimates of any fatalities, injuries and property damage.

The original raw data file is a zipped ~50Mb file containing an about 500Mb database and so the strategy for saving computer space (and computing time) chosen for analysing the data is to load only the variable that are necessary for addressing the proposed questions.

The following code extracts the first lines from the .csv file to give us information about the variable it contains

destfile <- "./Data/repdata_data_StormData.csv.bz2"
if(!file.exists(destfile))
{
    dir.create("./Data")
    destfile <- "./Data/repdata_data_StormData.csv.bz2"
    urlziplocation <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
    download.file(urlziplocation, destfile)
    names <- read.csv(destfile, nrows = 1, header = TRUE)
}else{
    names <- read.csv(destfile, nrows = 1, header = TRUE)
}
names(names)

##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"

The database contains 37 variables, in order to address the questions, only 9 variables (namely BGN_DATE, STATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP) are extracted from the .csv file.

df.rawdata <- read.csv("./Data/repdata_data_StormData.csv.bz2",
                       colClasses = c("NULL", "character", rep("NULL",4), 
                                      rep("character",2), rep("NULL",14),
                                      rep("numeric",3), "character", "numeric", 
                                      "character", rep("NULL",9)))

After loading the raw data, the database is further reduced by considering raws where at least one fatality/injury/property or crop damage have been recorder.

df.wdata <- df.rawdata[!(df.rawdata$FATALITIES == 0 & df.rawdata$INJURIES == 0 &
                           df.rawdata$PROPDMG == 0 & df.rawdata$CROPDMG == 0),]

In this way the original database consisting of 902297 $\times$ 37 values has been pruned into a more handable 254633 $\times$ 9 values database.

Creating YEAR variable

A new YEAR variable is added to the database to evaluate in which year starting to group the data as not only certain events have been recorded from the beginning of the database creation.

df.wdata$YEAR <- year(as.Date(sub(" .*",  "", df.wdata$BGN_DATE), format("%m/%d/%Y")))

Computing Crop and Properties economic costs

The agriculturals and properties damage costs are stored in the CROPDMG and PROPDMG variables as numerical value while in PROPDMGEXP and CROPDMGEXP there is their exponent. The following function the damage value by multiplying the registred costs for the recorded exponent.

convert.exp <- function(x)
{
    if(x == "B")
        return(1000000000)
    else if(x == "M" | x == "m")
        return(1000000)
    else if(x == "K" | x == "k")
        return(1000)
    else if(x == "H" | x == "h")
        return(100)
    return(1)
}

Since I could not find the documentation about some exponent factor, I decided to set the unknown characters to 1. The function can be easily changed if further information is added. The new variables PROPDMGCONV and CROPDMGCONV contains the converted cost values.

df.wdata$PROPDMGCONV <- sapply(df.wdata$PROPDMGEXP, convert.exp) * df.wdata$PROPDMG

df.wdata$CROPDMGCONV <- sapply(df.wdata$CROPDMGEXP, convert.exp) * df.wdata$CROPDMG

Selecting and merging Event Type

The National Weather Service Storm Data Documentation document lists $48$ different event type (ref. section 2.1.1 for the Storm Data Event Table and Chapter 7 for event type description) while EVTYPE database variable contains 488 different values. The following code reduce the number of unique Event Types (the full code is in Appendix and future works.

df.wdata[grep("Marine hail",
              df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Marine Hail"
df.wdata[grep("Marine High Wind",
              df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Marine High Wind"
df.wdata[grep("Marine Strong Wind",
              df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Marine Strong Wind"
df.wdata[grep("Marine Tstm|Marine Thunde",
              df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Marine Thunderstorm Wind"
df.wdata[grep("Lightn|lighting|LIGNTNING|Lightning",
              df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Lightning"
df.wdata[grep("Thunderstorm  winds|Tstm w|tstmw|Thunderstormwinds|Thunderstorms w|
              Thunderstormw|Thunderstom winds|Tunderstom winds",
              df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Thunderstorm Wind"
df.wdata[grep("THUND|THUNE",
              df.wdata$EVTYPE, ignore.case = FALSE),]$EVTYPE <- "Thunderstorm"

Since the documentation is not perfectly clear and the renaming process is quite tedious, the proposed solution is far to be optimal but gives an idea on how to takle this issue. In this way, the number of unique Event Type has been reduced to 102.

Health and Economic Costs analysis

The Storm Event Database webpage states that all different events started to be registred from 1996 onwards. For this reason, in this project only events happened after 1995 are considered.

Most harmful events

The following code filter the database by grouping the data by Event Type and then calculating the sum of fatalities and injuries

df.casualties <- df.wdata[df.wdata$YEAR > 1995,] %>% 
    group_by(EVTYPE) %>%
    summarise(sum_fatalities = sum(FATALITIES), 
              sum_injuries = sum(INJURIES)) %>%
    mutate(total_sum = sum_fatalities + sum_injuries)

From this database, the top 10 most harmful events, with respect to fatalities, injuries and their sum, are extracted.

df.top10sumfatalities <- 
    as.data.frame(df.casualties[order(df.casualties$sum_fatalities, 
                                                  decreasing = TRUE)[1:10],])
df.top10suminjuries <- 
    as.data.frame(df.casualties[order(df.casualties$sum_injuries, 
                                                  decreasing = TRUE)[1:10],])
df.top10total <- 
    as.data.frame(df.casualties[order(df.casualties$total_sum, 
                                      decreasing = TRUE)[1:10],])

Economic consequences

A similar approach has been used for the economic consequences. First grouping by event type and calculating their sum.

df.cost <- df.wdata[df.wdata$YEAR > 1995,] %>% 
    group_by(EVTYPE) %>%
    summarise(cropdmg = sum(CROPDMGCONV), 
              propdmg = sum(PROPDMGCONV)) %>%
    mutate(total_sum = cropdmg + propdmg)

And then extracting the top 10 events that have a major impact on agriculture (crop) and properties.

df.top10PROPcost <- as.data.frame(df.cost[order(df.cost$propdmg, 
                                            decreasing = TRUE)[1:10], ])
df.top10CROPcost <- as.data.frame(df.cost[order(df.cost$cropdmg, 
                                            decreasing = TRUE)[1:10], ])
df.top10cost <- as.data.frame(df.cost[order(df.cost$total_sum, 
                                                decreasing = TRUE)[1:10], ])

Results

Most Harmful Events

As for the most harmful events, the following plot shows the top 10 Events that caused more fatalities, injuries and the sum of both of them.

g1 <- ggplot(df.top10sumfatalities, aes(reorder(EVTYPE, -sum_fatalities), sum_fatalities)) +
    geom_bar(stat = "identity", position = "dodge", fill = "salmon") +
    labs(title = "Sum Fatalities", x = "Event Type", y = "Number of Casualties") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))

g2 <- ggplot(df.top10suminjuries,aes(reorder(EVTYPE, -sum_injuries), sum_injuries)) +
    geom_bar(stat = "identity", position = "dodge", fill = "springgreen3") +
    labs(title = "Sum Injuries", x = "Event Type", y = "Number of Casualties") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))

df.top10totalplot <- melt(df.top10total[,c("EVTYPE", "sum_fatalities", "sum_injuries", "total_sum")], id.vars = 1)

g3 <- ggplot(df.top10totalplot, aes(reorder(EVTYPE, -value), value)) +
    geom_bar(aes(fill = variable), stat = "identity", position = "dodge") +
    labs(title = "Fatalities + Injuries", x = "Event Type", y = "Number of Casualties") +
    scale_fill_discrete(labels = c("Fatalities", "Injuries", "Fatalities + Injuries")) + 
    theme(legend.title = element_blank(), 
          legend.background = element_rect(fill="transparent"),
          legend.position=c(.75,.85), 
          legend.key.height = unit(0.2, "cm"),legend.key.width = unit(0.2, "cm"),
          axis.text.x = element_text(angle = 90, hjust = 1))

gridExtra::grid.arrange(g1, g2, g3, nrow = 1,
                        bottom=textGrob("Figure 1: 10 ten most harmful events", gp=gpar(fontsize=11)))

The plot shows that, as expected, fatalities are much more less than injuries (a 10 factor difference as depicted in the Fatalitie + Injuries plot) and while Heat is the weather event causing more fatalities (followed by Tornado, Flood and Lightnign), Tornado are, by far, the major cause of injuries in US (with Flood, Heat and Lighting other major causes).

The following table report the different casualties for event type.

Event Type	Fatalities	Event Type	Injuries	Event Type	Total
Heat	2036	Tornado	20667	Tornado	22178
Tornado	1511	Flood	8520	Flood	9857
Flood	1337	Heat	7683	Heat	9719
Lightning	651	Lightning	4141	Lightning	4792
Rip Current	542	Thunderstorm Wind	3729	Thunderstorm Wind	3976
Strong Wind	383	Strong Wind	1507	Strong Wind	1890
Avalanche	266	Wildfire	1458	Wildfire	1545
Thunderstorm Wind	247	Thunderstorm	1401	Thunderstorm	1533
Cold Wind Chill	237	Hurricane Typhoon	1328	Winterstorm	1488
Winterstorm	195	Winterstorm	1293	Hurricane Typhoon	1453

Most costly events

In the case of most costly events, we can directly aggregate agricultural and property damages since they are now comparable in terms of costs.

g1 <- ggplot(df.top10CROPcost, aes(reorder(EVTYPE, -cropdmg), cropdmg)) +
    geom_bar(stat = "identity", position = "dodge", fill = "salmon") +
    labs(title = "Crop Damages",x = "Event Type",y = "Cost in US $") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))

g2 <- ggplot(df.top10PROPcost, aes(reorder(EVTYPE, -propdmg), propdmg)) +
    geom_bar(stat = "identity", position = "dodge", fill = "springgreen3") +
    labs(title = "Property Damages",x = "Event Type",y = "Cost in US $") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))

df.top10costplot <- melt(df.top10cost[,c("EVTYPE", "cropdmg", "propdmg", "total_sum")], id.vars = 1)

g3 <- ggplot(df.top10costplot, aes(reorder(EVTYPE, -value), value)) +
    geom_bar(aes(fill = variable), stat = "identity", position = "dodge") +
    labs(title = "Property + Crop damages", x = "Event Type", y = "Cost in US $") +
    scale_fill_discrete(labels = c("Crop", "Property", "Sum")) + 
    theme(legend.background = element_rect(fill="transparent", ),
          legend.position=c(.75,.85), 
          legend.key.height = unit(0.2, "cm"), legend.key.width = unit(0.2, "cm"),
          axis.text.x = element_text(angle = 90, hjust = 1))

gridExtra::grid.arrange(g1, g2, g3, nrow = 1, 
                        bottom=textGrob("Figure 2: 10 ten most economically costly events",gp=gpar(fontsize=11)))

As we can see from the plot, properties damages are much higer than crop ones (approximately the 90% of registred costs comes from property damages). And if Drought is the most economically effective cost for agriculture (followed by Flood and Hurricanes), Flood is the major cause for property damages (followed by Hurricanes and Storm Surge).

The following table summarise the total costs (in US $).

Event Type	Crop damages	Event Type	Property damages	Event Type	Total Cost
Drought	13367566000	Flood	159765842670	Flood	166113905870
Flood	6348063200	Hurricane Typhoon	81718889010	Hurricane Typhoon	87068996810
Hurricane Typhoon	5350107800	Storm Surge	47834724000	Storm Surge	47835579000
Frost Freeze	2682776500	Tornado	24616945710	Tornado	24900370720
Funnel Cloud	2496822450	Funnel Cloud	14595347520	Funnel Cloud	17092169970
Heavy Rain	729669800	Wildfire	7760449500	Drought	14413667000
Strong Wind	699024800	Tropical Storm	7642475550	Tropical Storm	8320186550
Tropical Storm	677711000	Strong Wind	5430792430	Wildfire	8162704630
Thunderstorm Wind	618611600	Thunderstorm Wind	4530861440	Strong Wind	6129817230
Heat	492578500	Ice	3642260810	Thunderstorm Wind	5149473040

Appendix and future works

This analysis is fully available on my git repository: https://github.com/pacoraggio/ReproducibleResearchPA2 (copy the link if Rpubs doesn’t allow you to open a new window) as a public project. After the course evaluation, I would like to perform time and geographical analyses on the data to see if local authorities have worked to prevent further damage. Please feel free to consult the repository and contact me if you have any suggestion (contact details on GitHub).

The following is the complete code for renaming and grouping the different event types. As part of the future works I will redefine the way to group the event types.

df.wdata[grep("Marine hail",
              df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Marine Hail"
df.wdata[grep("Marine High Wind",
              df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Marine High Wind"
df.wdata[grep("Marine Strong Wind",
              df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Marine Strong Wind"
df.wdata[grep("Marine Tstm|Marine Thunde",
              df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Marine Thunderstorm Wind"
df.wdata[grep("Lightn|lighting|LIGNTNING|Lightning",
              df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Lightning"
df.wdata[grep("Thunderstorm  winds|Tstm w|tstmw|Thunderstormwinds|Thunderstorms w|Thunderstormw|Thunderstom winds|Tunderstom winds",
              df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Thunderstorm Wind"
df.wdata[grep("THUND|THUNE",
              df.wdata$EVTYPE, ignore.case = FALSE),]$EVTYPE <- "Thunderstorm"
df.wdata[grep("Astronomical L",
              df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Astronomical Low Tide"
df.wdata[grep("Avalanc|Slid",
              df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Avalanche"
df.wdata[grep("Bliz",
              df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Blizzard"
df.wdata[grep("Flood|Fld",
              df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Flood"
df.wdata[grep("Chill",
              df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Cold Wind Chill"
df.wdata[grep("Freezing Fog",
              df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Freezing Fog"
df.wdata[grep("FOG",
              df.wdata$EVTYPE, ignore.case = FALSE),]$EVTYPE <- "Dense Fog"
df.wdata[grep("Dense Smoke",
              df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Dense Smoke"
df.wdata[grep("Droug",
              df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Drought"
df.wdata[grep("Devil",
              df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Dust Devil"
df.wdata[grep("DUST",
              df.wdata$EVTYPE, ignore.case = FALSE),]$EVTYPE <- "Dust Storm"
df.wdata[grep("Heat",
              df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Heat"
df.wdata[grep("Frost",
              df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Frost"
df.wdata[grep("Funnel",
              df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Funnel Cloud"
df.wdata[grep("Heavy Rain|hvy rain|Rainfall|Rainstorm",
              df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Heavy Rain"
df.wdata[grep("snow",
              df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Snow"
df.wdata[grep("Surf",
              df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Surf"
df.wdata[grep("Hurr|Typh",
              df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Hurricane Typhoon"
df.wdata[grep("Sleet",
              df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Sleet"
df.wdata[grep("Ice",
              df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Ice"
df.wdata[grep("Curre",
              df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Rip Current"
df.wdata[grep("Seiche",
              df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Seiche"
df.wdata[grep("Storm Surge",
              df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Storm Surge"
df.wdata[grep("WIND",
              df.wdata$EVTYPE, ignore.case = FALSE),]$EVTYPE <- "Strong Wind"
df.wdata[grep("Strong Wind|Strong Winds|Whirlwind|Gusty Winds|Gusty wind|Wind Damage|Gradient wind|gradient wind",
              df.wdata$EVTYPE, ignore.case = FALSE),]$EVTYPE <- "Strong Wind"
df.wdata[grep("Torn",
              df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Tornado"
df.wdata[grep("Tropical dep",
              df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Tropical Depression"
df.wdata[grep("Tropical sto",
              df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Tropical Storm"
df.wdata[grep("Tsu",
              df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Tsunami"
df.wdata[grep("Volcanic",
              df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Volcanic Ash"
df.wdata[grep("Waterspou",
              df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Waterspout"
df.wdata[grep("fire",
              df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Wildfire"
df.wdata[grep("STORM",
              df.wdata$EVTYPE, ignore.case = FALSE),]$EVTYPE <- "Winterstorm"
df.wdata[grep("WINTER",
              df.wdata$EVTYPE, ignore.case = FALSE),]$EVTYPE <- "Winter Weather"
df.wdata[grep("HAIL",
              df.wdata$EVTYPE, ignore.case = FALSE),]$EVTYPE <- "Funnel Cloud"
df.wdata[grep("COLD|HYPO",
              df.wdata$EVTYPE, ignore.case = FALSE),]$EVTYPE <- "Frost Freeze"
df.wdata[grep("Freeze|Frost",
              df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Frost Freeze"
df.wdata[grep("Frost Freeze|FREEZING RAIN|FREEZING DRIZZLE|Freezing Spray|Freezing Drizzle|Freezing Rain|Freezing drizzle|LIGHT FREEZING RAIN",
              df.wdata$EVTYPE, ignore.case = FALSE),]$EVTYPE <- "Frost Freeze"

An analysis of the Most Harmful and Destructive Weather Events in US – From 1996 to 2011

Paolo Coraggio

16/01/2020