Synopsis

The goal of this project is to explore data contained in the Storm Event Database from 1950 to 2011 to investigate which types of weather events are most harmful with respect to population health and which have the greatest economic consequences. The project is organised in the following sections:

Data Processing

The National Oceanic and Atmospheric Administration (NOAA) is an US scientific agency that focuses on “conditions of the oceans, major waterways, and atmosphere” which one of the main activity is “Monitoring and observing Earth systems with instruments and data collection networks”. The Storm Event Database contains data from 1950 to date that tracks the caracteristics of major weather events including an estimates of any fatalities, injuries and property damage.

The original raw data file is a zipped ~50Mb file containing an about 500Mb database and so the strategy for saving computer space (and computing time) chosen for analysing the data is to load only the variable that are necessary for addressing the proposed questions.

The following code extracts the first lines from the .csv file to give us information about the variable it contains

destfile <- "./Data/repdata_data_StormData.csv.bz2"
if(!file.exists(destfile))
{
    dir.create("./Data")
    destfile <- "./Data/repdata_data_StormData.csv.bz2"
    urlziplocation <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
    download.file(urlziplocation, destfile)
    names <- read.csv(destfile, nrows = 1, header = TRUE)
}else{
    names <- read.csv(destfile, nrows = 1, header = TRUE)
}
names(names)
##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"

The database contains 37 variables, in order to address the questions, only 9 variables (namely BGN_DATE, STATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP) are extracted from the .csv file.

df.rawdata <- read.csv("./Data/repdata_data_StormData.csv.bz2",
                       colClasses = c("NULL", "character", rep("NULL",4), 
                                      rep("character",2), rep("NULL",14),
                                      rep("numeric",3), "character", "numeric", 
                                      "character", rep("NULL",9)))

After loading the raw data, the database is further reduced by considering raws where at least one fatality/injury/property or crop damage have been recorder.

df.wdata <- df.rawdata[!(df.rawdata$FATALITIES == 0 & df.rawdata$INJURIES == 0 &
                           df.rawdata$PROPDMG == 0 & df.rawdata$CROPDMG == 0),]

In this way the original database consisting of 902297 \(\times\) 37 values has been pruned into a more handable 254633 \(\times\) 9 values database.

Creating YEAR variable

A new YEAR variable is added to the database to evaluate in which year starting to group the data as not only certain events have been recorded from the beginning of the database creation.

df.wdata$YEAR <- year(as.Date(sub(" .*",  "", df.wdata$BGN_DATE), format("%m/%d/%Y")))

Computing Crop and Properties economic costs

The agriculturals and properties damage costs are stored in the CROPDMG and PROPDMG variables as numerical value while in PROPDMGEXP and CROPDMGEXP there is their exponent. The following function the damage value by multiplying the registred costs for the recorded exponent.

convert.exp <- function(x)
{
    if(x == "B")
        return(1000000000)
    else if(x == "M" | x == "m")
        return(1000000)
    else if(x == "K" | x == "k")
        return(1000)
    else if(x == "H" | x == "h")
        return(100)
    return(1)
}

Since I could not find the documentation about some exponent factor, I decided to set the unknown characters to 1. The function can be easily changed if further information is added. The new variables PROPDMGCONV and CROPDMGCONV contains the converted cost values.

df.wdata$PROPDMGCONV <- sapply(df.wdata$PROPDMGEXP, convert.exp) * df.wdata$PROPDMG

df.wdata$CROPDMGCONV <- sapply(df.wdata$CROPDMGEXP, convert.exp) * df.wdata$CROPDMG 

Selecting and merging Event Type

The National Weather Service Storm Data Documentation document lists \(48\) different event type (ref. section 2.1.1 for the Storm Data Event Table and Chapter 7 for event type description) while EVTYPE database variable contains 488 different values. The following code reduce the number of unique Event Types (the full code is in Appendix and future works.

df.wdata[grep("Marine hail",
              df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Marine Hail"
df.wdata[grep("Marine High Wind",
              df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Marine High Wind"
df.wdata[grep("Marine Strong Wind",
              df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Marine Strong Wind"
df.wdata[grep("Marine Tstm|Marine Thunde",
              df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Marine Thunderstorm Wind"
df.wdata[grep("Lightn|lighting|LIGNTNING|Lightning",
              df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Lightning"
df.wdata[grep("Thunderstorm  winds|Tstm w|tstmw|Thunderstormwinds|Thunderstorms w|
              Thunderstormw|Thunderstom winds|Tunderstom winds",
              df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Thunderstorm Wind"
df.wdata[grep("THUND|THUNE",
              df.wdata$EVTYPE, ignore.case = FALSE),]$EVTYPE <- "Thunderstorm"

Since the documentation is not perfectly clear and the renaming process is quite tedious, the proposed solution is far to be optimal but gives an idea on how to takle this issue. In this way, the number of unique Event Type has been reduced to 102.

Health and Economic Costs analysis

The Storm Event Database webpage states that all different events started to be registred from 1996 onwards. For this reason, in this project only events happened after 1995 are considered.

Most harmful events

The following code filter the database by grouping the data by Event Type and then calculating the sum of fatalities and injuries

df.casualties <- df.wdata[df.wdata$YEAR > 1995,] %>% 
    group_by(EVTYPE) %>%
    summarise(sum_fatalities = sum(FATALITIES), 
              sum_injuries = sum(INJURIES)) %>%
    mutate(total_sum = sum_fatalities + sum_injuries)

From this database, the top 10 most harmful events, with respect to fatalities, injuries and their sum, are extracted.

df.top10sumfatalities <- 
    as.data.frame(df.casualties[order(df.casualties$sum_fatalities, 
                                                  decreasing = TRUE)[1:10],])
df.top10suminjuries <- 
    as.data.frame(df.casualties[order(df.casualties$sum_injuries, 
                                                  decreasing = TRUE)[1:10],])
df.top10total <- 
    as.data.frame(df.casualties[order(df.casualties$total_sum, 
                                      decreasing = TRUE)[1:10],])

Economic consequences

A similar approach has been used for the economic consequences. First grouping by event type and calculating their sum.

df.cost <- df.wdata[df.wdata$YEAR > 1995,] %>% 
    group_by(EVTYPE) %>%
    summarise(cropdmg = sum(CROPDMGCONV), 
              propdmg = sum(PROPDMGCONV)) %>%
    mutate(total_sum = cropdmg + propdmg)

And then extracting the top 10 events that have a major impact on agriculture (crop) and properties.

df.top10PROPcost <- as.data.frame(df.cost[order(df.cost$propdmg, 
                                            decreasing = TRUE)[1:10], ])
df.top10CROPcost <- as.data.frame(df.cost[order(df.cost$cropdmg, 
                                            decreasing = TRUE)[1:10], ])
df.top10cost <- as.data.frame(df.cost[order(df.cost$total_sum, 
                                                decreasing = TRUE)[1:10], ])

Results

Most Harmful Events

As for the most harmful events, the following plot shows the top 10 Events that caused more fatalities, injuries and the sum of both of them.

g1 <- ggplot(df.top10sumfatalities, aes(reorder(EVTYPE, -sum_fatalities), sum_fatalities)) +
    geom_bar(stat = "identity", position = "dodge", fill = "salmon") +
    labs(title = "Sum Fatalities", x = "Event Type", y = "Number of Casualties") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))

g2 <- ggplot(df.top10suminjuries,aes(reorder(EVTYPE, -sum_injuries), sum_injuries)) +
    geom_bar(stat = "identity", position = "dodge", fill = "springgreen3") +
    labs(title = "Sum Injuries", x = "Event Type", y = "Number of Casualties") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))

df.top10totalplot <- melt(df.top10total[,c("EVTYPE", "sum_fatalities", "sum_injuries", "total_sum")], id.vars = 1)

g3 <- ggplot(df.top10totalplot, aes(reorder(EVTYPE, -value), value)) +
    geom_bar(aes(fill = variable), stat = "identity", position = "dodge") +
    labs(title = "Fatalities + Injuries", x = "Event Type", y = "Number of Casualties") +
    scale_fill_discrete(labels = c("Fatalities", "Injuries", "Fatalities + Injuries")) + 
    theme(legend.title = element_blank(), 
          legend.background = element_rect(fill="transparent"),
          legend.position=c(.75,.85), 
          legend.key.height = unit(0.2, "cm"),legend.key.width = unit(0.2, "cm"),
          axis.text.x = element_text(angle = 90, hjust = 1))

gridExtra::grid.arrange(g1, g2, g3, nrow = 1,
                        bottom=textGrob("Figure 1: 10 ten most harmful events", gp=gpar(fontsize=11)))

The plot shows that, as expected, fatalities are much more less than injuries (a 10 factor difference as depicted in the Fatalitie + Injuries plot) and while Heat is the weather event causing more fatalities (followed by Tornado, Flood and Lightnign), Tornado are, by far, the major cause of injuries in US (with Flood, Heat and Lighting other major causes).

The following table report the different casualties for event type.

Event Type Fatalities Event Type Injuries Event Type Total
Heat 2036 Tornado 20667 Tornado 22178
Tornado 1511 Flood 8520 Flood 9857
Flood 1337 Heat 7683 Heat 9719
Lightning 651 Lightning 4141 Lightning 4792
Rip Current 542 Thunderstorm Wind 3729 Thunderstorm Wind 3976
Strong Wind 383 Strong Wind 1507 Strong Wind 1890
Avalanche 266 Wildfire 1458 Wildfire 1545
Thunderstorm Wind 247 Thunderstorm 1401 Thunderstorm 1533
Cold Wind Chill 237 Hurricane Typhoon 1328 Winterstorm 1488
Winterstorm 195 Winterstorm 1293 Hurricane Typhoon 1453

Most costly events

In the case of most costly events, we can directly aggregate agricultural and property damages since they are now comparable in terms of costs.

g1 <- ggplot(df.top10CROPcost, aes(reorder(EVTYPE, -cropdmg), cropdmg)) +
    geom_bar(stat = "identity", position = "dodge", fill = "salmon") +
    labs(title = "Crop Damages",x = "Event Type",y = "Cost in US $") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))

g2 <- ggplot(df.top10PROPcost, aes(reorder(EVTYPE, -propdmg), propdmg)) +
    geom_bar(stat = "identity", position = "dodge", fill = "springgreen3") +
    labs(title = "Property Damages",x = "Event Type",y = "Cost in US $") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))

df.top10costplot <- melt(df.top10cost[,c("EVTYPE", "cropdmg", "propdmg", "total_sum")], id.vars = 1)

g3 <- ggplot(df.top10costplot, aes(reorder(EVTYPE, -value), value)) +
    geom_bar(aes(fill = variable), stat = "identity", position = "dodge") +
    labs(title = "Property + Crop damages", x = "Event Type", y = "Cost in US $") +
    scale_fill_discrete(labels = c("Crop", "Property", "Sum")) + 
    theme(legend.background = element_rect(fill="transparent", ),
          legend.position=c(.75,.85), 
          legend.key.height = unit(0.2, "cm"), legend.key.width = unit(0.2, "cm"),
          axis.text.x = element_text(angle = 90, hjust = 1))

gridExtra::grid.arrange(g1, g2, g3, nrow = 1, 
                        bottom=textGrob("Figure 2: 10 ten most economically costly events",gp=gpar(fontsize=11)))

As we can see from the plot, properties damages are much higer than crop ones (approximately the 90% of registred costs comes from property damages). And if Drought is the most economically effective cost for agriculture (followed by Flood and Hurricanes), Flood is the major cause for property damages (followed by Hurricanes and Storm Surge).

The following table summarise the total costs (in US $).

Event Type Crop damages Event Type Property damages Event Type Total Cost
Drought 13367566000 Flood 159765842670 Flood 166113905870
Flood 6348063200 Hurricane Typhoon 81718889010 Hurricane Typhoon 87068996810
Hurricane Typhoon 5350107800 Storm Surge 47834724000 Storm Surge 47835579000
Frost Freeze 2682776500 Tornado 24616945710 Tornado 24900370720
Funnel Cloud 2496822450 Funnel Cloud 14595347520 Funnel Cloud 17092169970
Heavy Rain 729669800 Wildfire 7760449500 Drought 14413667000
Strong Wind 699024800 Tropical Storm 7642475550 Tropical Storm 8320186550
Tropical Storm 677711000 Strong Wind 5430792430 Wildfire 8162704630
Thunderstorm Wind 618611600 Thunderstorm Wind 4530861440 Strong Wind 6129817230
Heat 492578500 Ice 3642260810 Thunderstorm Wind 5149473040

Appendix and future works

This analysis is fully available on my git repository: https://github.com/pacoraggio/ReproducibleResearchPA2 (copy the link if Rpubs doesn’t allow you to open a new window) as a public project. After the course evaluation, I would like to perform time and geographical analyses on the data to see if local authorities have worked to prevent further damage. Please feel free to consult the repository and contact me if you have any suggestion (contact details on GitHub).

The following is the complete code for renaming and grouping the different event types. As part of the future works I will redefine the way to group the event types.

df.wdata[grep("Marine hail",
              df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Marine Hail"
df.wdata[grep("Marine High Wind",
              df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Marine High Wind"
df.wdata[grep("Marine Strong Wind",
              df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Marine Strong Wind"
df.wdata[grep("Marine Tstm|Marine Thunde",
              df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Marine Thunderstorm Wind"
df.wdata[grep("Lightn|lighting|LIGNTNING|Lightning",
              df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Lightning"
df.wdata[grep("Thunderstorm  winds|Tstm w|tstmw|Thunderstormwinds|Thunderstorms w|Thunderstormw|Thunderstom winds|Tunderstom winds",
              df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Thunderstorm Wind"
df.wdata[grep("THUND|THUNE",
              df.wdata$EVTYPE, ignore.case = FALSE),]$EVTYPE <- "Thunderstorm"
df.wdata[grep("Astronomical L",
              df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Astronomical Low Tide"
df.wdata[grep("Avalanc|Slid",
              df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Avalanche"
df.wdata[grep("Bliz",
              df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Blizzard"
df.wdata[grep("Flood|Fld",
              df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Flood"
df.wdata[grep("Chill",
              df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Cold Wind Chill"
df.wdata[grep("Freezing Fog",
              df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Freezing Fog"
df.wdata[grep("FOG",
              df.wdata$EVTYPE, ignore.case = FALSE),]$EVTYPE <- "Dense Fog"
df.wdata[grep("Dense Smoke",
              df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Dense Smoke"
df.wdata[grep("Droug",
              df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Drought"
df.wdata[grep("Devil",
              df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Dust Devil"
df.wdata[grep("DUST",
              df.wdata$EVTYPE, ignore.case = FALSE),]$EVTYPE <- "Dust Storm"
df.wdata[grep("Heat",
              df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Heat"
df.wdata[grep("Frost",
              df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Frost"
df.wdata[grep("Funnel",
              df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Funnel Cloud"
df.wdata[grep("Heavy Rain|hvy rain|Rainfall|Rainstorm",
              df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Heavy Rain"
df.wdata[grep("snow",
              df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Snow"
df.wdata[grep("Surf",
              df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Surf"
df.wdata[grep("Hurr|Typh",
              df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Hurricane Typhoon"
df.wdata[grep("Sleet",
              df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Sleet"
df.wdata[grep("Ice",
              df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Ice"
df.wdata[grep("Curre",
              df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Rip Current"
df.wdata[grep("Seiche",
              df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Seiche"
df.wdata[grep("Storm Surge",
              df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Storm Surge"
df.wdata[grep("WIND",
              df.wdata$EVTYPE, ignore.case = FALSE),]$EVTYPE <- "Strong Wind"
df.wdata[grep("Strong Wind|Strong Winds|Whirlwind|Gusty Winds|Gusty wind|Wind Damage|Gradient wind|gradient wind",
              df.wdata$EVTYPE, ignore.case = FALSE),]$EVTYPE <- "Strong Wind"
df.wdata[grep("Torn",
              df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Tornado"
df.wdata[grep("Tropical dep",
              df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Tropical Depression"
df.wdata[grep("Tropical sto",
              df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Tropical Storm"
df.wdata[grep("Tsu",
              df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Tsunami"
df.wdata[grep("Volcanic",
              df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Volcanic Ash"
df.wdata[grep("Waterspou",
              df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Waterspout"
df.wdata[grep("fire",
              df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Wildfire"
df.wdata[grep("STORM",
              df.wdata$EVTYPE, ignore.case = FALSE),]$EVTYPE <- "Winterstorm"
df.wdata[grep("WINTER",
              df.wdata$EVTYPE, ignore.case = FALSE),]$EVTYPE <- "Winter Weather"
df.wdata[grep("HAIL",
              df.wdata$EVTYPE, ignore.case = FALSE),]$EVTYPE <- "Funnel Cloud"
df.wdata[grep("COLD|HYPO",
              df.wdata$EVTYPE, ignore.case = FALSE),]$EVTYPE <- "Frost Freeze"
df.wdata[grep("Freeze|Frost",
              df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Frost Freeze"
df.wdata[grep("Frost Freeze|FREEZING RAIN|FREEZING DRIZZLE|Freezing Spray|Freezing Drizzle|Freezing Rain|Freezing drizzle|LIGHT FREEZING RAIN",
              df.wdata$EVTYPE, ignore.case = FALSE),]$EVTYPE <- "Frost Freeze"