Synopsis

Based on the data from the ‘NOAA Storm Database’ (from 1950 to November 2011), this document aims to answer two questions related to severe weather events. First, we’ll try to know what is the most harmful types of events, accross the United States, with respect to population health. Then, we’ll find out what types of events, still accross the US, have the greatest economic consequences.

Data Processing

The data from the ‘NOAA Storm Database’ are provided into a ‘.bz2’ file that first needs to be unzip before being loaded and processed for further analysis. The code below shows the way to unzip the file ‘repdata-data-StormData.csv.bz2’, and load the data into a dataframe object ‘repdata’.
Out of the 37 variables provided, we also focus on those that will be relevant for our analysis:
‘EVTYPE’, ‘BGN_DATE’, ‘FATALITIES’, ‘INJURIES’, ‘PROPDMG’, ‘PROPDMGEXP’, ‘CROPDMG’ and ‘CROPDMGEXP’

# Load the data from file 'repdata-data-StormData.csv.bz2' and store them into
# dataframe 'repdata'. We use the option 'cache = TRUE' into this chunk in order
# not to reload the csv file at any time we execute the compilation of this
# markdown file.
setAs("character","myDate", function(from) as.Date(from, format="%d/%m/%Y %h:%m:%s") )
repdata <- read.csv(bzfile('repdata-data-StormData.csv.bz2'),
                   header = TRUE,
                   sep = ',',
                   colClasses = col_classes)
repdata <- repdata[ , c('EVTYPE',
                        'BGN_DATE',
                        'FATALITIES',
                        'INJURIES',
                        'PROPDMG',
                        'PROPDMGEXP',
                        'CROPDMG',
                        'CROPDMGEXP')]

The data then need to be reprocessed before being analysed

  1. Values in column ‘EVTYPE’ (events types) are to be clean and set to uppercase, in order to be grouped later on
# Change EVTYPE content to upper case and apply strip (left and right)
# White spaces could have been removed while reading the file using parameter
# 'strip.white=TRUE'
repdata$EVTYPE <- sapply(repdata$EVTYPE,
                         FUN = function (x) toupper(str_trim(x)))
  1. We noticed that values in column ‘EVTYPE’ of the dataset could not be used as they are, due to mistakes in spelling (eg. ‘TORNDAO’ instead of ‘TORNADO’), or sub-categories that are not relevant for our analysis (eg. ‘Hail 0.75 inch’ and ‘Hail 0.88 inch’ shall both be considered as ‘Hail’). Therefore, we will add a column ‘EVTYPE2’ and we will define a correspondence table to map one to each other.
# Define a correspondence table to map new column 'EVTYPE2' with values in
# column 'EVTYPE'
evtypes_dictionary <- list('TORNADO' = 'TORNADO',
                           'TORNADO' = 'TORNDAO',
                           'TORNADO' = 'LANDSPOUT',
                           'AVALANCHE' = 'AVALANC',
                           'BLIZZARD' = 'BLIZZARD',
                           'BLIZZARD' = 'WINTER STORM',
                           'COASTAL FLOODING/EROSION' = 'BEACH EROSI',
                           'COASTAL FLOODING/EROSION' = 'BEACH FLOOD',
                           'COASTAL FLOODING/EROSION' = 'FLOODING/EROSION',
                           'COASTAL FLOODING/EROSION' = 'COASTAL EROSION',
                           'COASTAL FLOODING/EROSION' = 'COASTAL FLOOD',
                           'COASTAL FLOODING/EROSION' = 'CSTL FLOOD',
                           'COASTAL FLOODING/EROSION' = 'COASTAL/TIDAL FLOOD',
                           'COASTAL FLOODING/EROSION' = 'COASTALFLOOD',
                           'COASTAL FLOODING/EROSION' = 'COASTAL SURGE',
                           'HEAVY SEA/TIDE/CURRENT' = 'CURRENT',
                           'HEAVY SEA/TIDE/CURRENT' = 'HIGH SEA',
                           'HEAVY SEA/TIDE/CURRENT' = 'HEAVY SEA',
                           'HEAVY SEA/TIDE/CURRENT' = 'ROUGH SEAS',
                           'HEAVY SEA/TIDE/CURRENT' = 'HIGH WAVES',
                           'HEAVY SEA/TIDE/CURRENT' = 'HEAT WAVE',
                           'HEAVY SEA/TIDE/CURRENT' = 'HIGH WATER',
                           'HEAVY SEA/TIDE/CURRENT' = 'HIGH TIDE',
                           'HEAVY SEA/TIDE/CURRENT' = 'LOW TIDE',
                           'HEAVY SEA/TIDE/CURRENT' = 'BLOW-OUT TIDE',
                           'HEAVY SEA/TIDE/CURRENT' = 'ROUGH SURF',
                           'HEAVY SEA/TIDE/CURRENT' = 'HEAVY SURF',
                           'HEAVY SEA/TIDE/CURRENT' = 'HIGH SURF',
                           'HEAVY SEA/TIDE/CURRENT' = 'SWELL',
                           'HEAVY SEA/TIDE/CURRENT' = 'HAZARDOUS SURF',
                           'HEAVY SEA/TIDE/CURRENT' = 'ROGUE WAVE',
                           'FUNNEL CLOUD' = 'FUNNEL',
                           'GLAZE' = 'GLAZE',
                           'GUSTNADO' = 'GUSTNADO',
                           'WATERSPOUT' = 'WATERSPOUT',
                           'WATERSPOUT' = 'WATER SPOUT',
                           'WATERSPOUT' = 'WAYTERSPOUT',
                           'THUNDERSTORM/LIGHTNING' = 'THUNDERSTORM',
                           'THUNDERSTORM/LIGHTNING' = 'THUDERSTORM',
                           'THUNDERSTORM/LIGHTNING' = 'THUNDEERSTORM',
                           'THUNDERSTORM/LIGHTNING' = 'THUNDERESTORM',
                           'THUNDERSTORM/LIGHTNING' = 'THUNDERSTROM',
                           'THUNDERSTORM/LIGHTNING' = 'THUNDERTORM',
                           'THUNDERSTORM/LIGHTNING' = 'THUNDERTSORM',
                           'THUNDERSTORM/LIGHTNING' = 'THUNDESTORM',
                           'THUNDERSTORM/LIGHTNING' = 'THUNERSTORM',
                           'THUNDERSTORM/LIGHTNING' = 'TUNDERSTORM',
                           'THUNDERSTORM/LIGHTNING' = 'LIGHTNING',
                           'THUNDERSTORM/LIGHTNING' = 'LIGHTING',
                           'THUNDERSTORM/LIGHTNING' = 'LIGNTNING',
                           'TROPICAL STORM' = 'TROPICAL ',
                           'TROPICAL STORM' = 'TSTM',
                           'TYPHOON' = 'TYPHOON',
                           'HURRICANE' = 'HURRICANE',
                           'WALL CLOUD' = 'WALL CLOUD',
                           'COASTAL STORM' = 'COASTAL STORM',
                           'COASTAL STORM' = 'COASTALSTORM',
                           'OTHER STORMS' = 'STORM ',
                           'DAM BREAK' = 'DAM BREAK',
                           'TSUNAMI' = 'TSUNAMI',
                           'MICROBURST' = 'MICROBURST',
                           'SMOKE' = 'SMOKE',
                           'HAIL' = 'HAIL',
                           'DUST STORM' = 'DUST',
                           'VOLCANIC ERUPTION' = 'VOLCANIC ',
                           'WILD FIRE' = 'WILD FIRE',
                           'WILD FIRE' = 'WILD/FOREST',
                           'WILD FIRE' = 'WILDFIRE',
                           'WILD FIRE' = 'FOREST FIRE',
                           'WILD FIRE' = 'GRASS FIRE',
                           'WILD FIRE' = 'BRUSH FIRE',
                           'WILD FIRE' = 'RED FLAG FIRE',
                           'WIND' = 'WIND',
                           'WIND' = 'WND',
                           'WIND' = 'GUSTY',
                           'WIND' = 'DOWNBURST',
                           'FLOODING/FLASH FLOODING' = 'FLOOD',
                           'FLOODING/FLASH FLOODING' = 'FLOOOD',
                           'FLOODING/FLASH FLOODING' = 'SMALL STREAM',
                           'FLOODING/FLASH FLOODING' = 'SML STREAM',
                           'FLOODING/FLASH FLOODING' = 'URBAN AND SMALL',
                           'FLOODING/FLASH FLOODING' = 'URBAN SMALL',
                           'FLOODING/FLASH FLOODING' = 'URBAN/SMALL',
                           'FLOODING/FLASH FLOODING' = 'URBAN/SML',
                           'FLOODING/FLASH FLOODING' = 'URBAN/STREET',
                           'FLOODING/FLASH FLOODING' = 'RAPIDLY RISING WATER',
                           'COLD/SNOW/WINTER WEATHER' = 'SNOWFALL',
                           'COLD/SNOW/WINTER WEATHER' = 'BLOWING SNO',
                           'COLD/SNOW/WINTER WEATHER' = 'FROST',
                           'COLD/SNOW/WINTER WEATHER' = 'COLD',
                           'COLD/SNOW/WINTER WEATHER' = 'FREEZ',
                           'COLD/SNOW/WINTER WEATHER' = 'UNSEASONABLY COOL',
                           'COLD/SNOW/WINTER WEATHER' = 'LOW TEMP',
                           'COLD/SNOW/WINTER WEATHER' = 'RECORD LOW',
                           'COLD/SNOW/WINTER WEATHER' = 'RECORD COOL',
                           'COLD/SNOW/WINTER WEATHER' = 'HYPOTHERMIA',
                           'COLD/SNOW/WINTER WEATHER' = 'SNOW',
                           'COLD/SNOW/WINTER WEATHER' = 'SLEET',
                           'COLD/SNOW/WINTER WEATHER' = 'WINTER WEATHER',
                           'COLD/SNOW/WINTER WEATHER' = 'WINTERY',
                           'COLD/SNOW/WINTER WEATHER' = 'WINTRY',
                           'COLD/SNOW/WINTER WEATHER' = 'WINTER MIX',
                           'DROUGHT' = 'DRY',
                           'DROUGHT' = 'DROUGHT',
                           'DROUGHT' = 'DRIEST',
                           'HEAT' = 'HEAT',
                           'HEAT' = 'RECORD HIGH',
                           'HEAT' = 'RECORD TEMPERATURE',
                           'HEAT' = 'TEMPERATURE RECORD',
                           'HEAT' = 'HYPERTHERMIA',
                           'HEAT' = 'WARM',
                           'HEAT' = 'HOT',
                           'RAIN' = 'PRECIP',
                           'RAIN' = 'RAIN',
                           'RAIN' = 'HEAVY SHOWER',
                           'ICE' = 'ICE',
                           'ICE' = 'ICY',
                           'MUD/ROCK SLIDE' = 'MUD',
                           'MUD/ROCK SLIDE' = 'ROCK SLIDE',
                           'MUD/ROCK SLIDE' = 'LANDSLIDE',
                           'MUD/ROCK SLIDE' = 'LANDSLUMP',
                           'FOG' = 'FOG',
                           'WETNESS' = 'WET',
                           'SUMMARY' = 'SUMMARY')

# Initialize the new column 'EVTYPE2'
repdata$EVTYPE2 <- NA

# Fill the column 'EVTYPE2' with 'EVTYPE' and corresponding table
# 'evtypes_dictionary'
for (i in 1:length(evtypes_dictionary)) {
        repdata[grepl(evtypes_dictionary[i], repdata$EVTYPE), ]$EVTYPE2 <-
                                                names(evtypes_dictionary[i])
}
  1. Finally, The variables ‘PROPDMGEXP’ and ‘CROPDMGEXP’ have the factor of multiplicity of the variables ‘PROPDMG’ and ‘CROPDMG’ with the values:
  • ‘H’: \(10^2\)
  • ‘K’: \(10^3\)
  • ‘M’: \(10^6\)
  • ‘B’: \(10^9\)
  • ‘numeric’: 10 \(^n\)\(^u\)\(^m\)\(^e\)\(^r\)\(^i\)\(^c\)
    We will add two new columns ‘PROPDMGDOLLARS’ and ‘CROPDMGDOLLARS’ which will contain the amounts in actual dollars
# Define a correspondence table in order to map the values in columns
# 'PROPDMGEXP' and 'CROPDMGEXP' with the factors to multiply the values in
# columns 'PROPDMG' and 'CROPDMG'
# To do so, we first look at the values that are used in columns 'PROPDMGEXP'
# and 'CROPDMGEXP':
unique(repdata$PROPDMGEXP)
##  [1] "K" "M" ""  "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-"
## [18] "1" "8"
unique(repdata$CROPDMGEXP)
## [1] ""  "M" "K" "m" "B" "?" "0" "k" "2"
# Then, define the correspondence table with vectors 'values' and 'factors':
search <- c('H', 'h', 'K', 'k', 'M', 'm', 'B', '', '-', '+', '?', '0', '1',
            '2', '3', '4', '5', '6', '7', '8')
replac <- c(1e2, 1e2, 1e3, 1e3, 1e6, 1e6, 1e9,  1,   1,   1,   1,   1,  10,
            1e2, 1e3, 1e4, 1e5, 1e6, 1e7, 1e8)
# Compute the multiplication vectors for columns 'PROPDMG' and 'CROPDMG' and
# Add 'PROPDMGDOLLARS' and 'CROPDMGDOLLARS' with expected values in dollars:
repdata$PROPDMGDOLLARS <- repdata$PROPDMG *
                          as.numeric(mapvalues(repdata$PROPDMGEXP, search, replac))
repdata$CROPDMGDOLLARS <- repdata$CROPDMG *
                          as.numeric(mapvalues(repdata$CROPDMGEXP, search, replac))

Results

1. Across the United States, which types of events (as indicated in the ‘EVTYPE’ variable) are most harmful with respect to population health?

# sum fatalities and injuries figures per EVTYPE2
fatal_injur_per_type <- repdata[, c('EVTYPE2', 'FATALITIES', 'INJURIES')] %>%
                           group_by(EVTYPE2) %>%
                           summarise_all(funs("sum"))

# add a column with sum of both fatilities and injuries
fatal_injur_per_type$FATALITIES_INJURIES <- apply(fatal_injur_per_type[c('FATALITIES', 'INJURIES')],
                                                  1,
                                                  sum,
                                                  na.rm = TRUE)

# order the dataframe with sum of fatilities per EVTYPE2
fatal_injur_per_type <- fatal_injur_per_type[with(fatal_injur_per_type,
                                                  order(FATALITIES,
                                                        na.last = TRUE,
                                                        decreasing = TRUE)), ]
# plot in a bargraph the 10 most harmful types of events with respect to population health
ggplot(fatal_injur_per_type[1:20, ],
       aes(x = EVTYPE2, y = FATALITIES_INJURIES, fill = FATALITIES_INJURIES)) +
        ggtitle('Number of victims by 20 most harmful weather events') +
        labs(x = 'Weather events, ordered based on the number of deaths',
             y = 'Number of victims (injuries in blue, deaths in dark red)') +
        geom_bar(stat = "identity",
                 width = .7,
                 position = "dodge") +
        scale_fill_gradient(low = "skyblue1",
                            high = "dodgerblue4",
                            name = "Number of victims") +
        geom_bar(data = fatal_injur_per_type[1:20, ],
                 aes(x = EVTYPE2, y = FATALITIES),
                 stat = "identity",
                 position = "dodge",
                 color = NA,
                 fill = "darkred",
                 width = .7) +
        theme(legend.text = element_text(size = 12,
                                         family = "Trebuchet MS",
                                         color = "#666666",
                                         face = "bold",
                                         hjust = 0),
              legend.title = element_text(size = 12,
                                          family = "Trebuchet MS",
                                          color = "#666666",
                                          face = "bold",
                                          hjust = 0),
              axis.text.x = element_text(angle = 45, hjust = 1),
              plot.title = element_text(family = "Trebuchet MS",
                                        color = "#666666",
                                        face = "bold",
                                        size = 18,
                                        hjust = 0),
              axis.title = element_text(family = "Trebuchet MS",
                                        color = "#666666",
                                        face = "bold",
                                        size = 12)) +
        scale_x_discrete(limits = fatal_injur_per_type[1:20,'EVTYPE2'][[1]])

According to the bar chart above, tornadoes are far the most harmful events with respect to the health with almost 100,000 victims since 1950 including more than 5,600 deaths. Are following the heat, with more than 12,400 victims (incl. 3,000+ deaths), Flooding and flash flooding with 10,200+ victims (incl. 1,500+ deaths), wind with 12,600+ victims (incl. 1,200+ deaths) and thunderstorm with 6,000+ victims (incl. 800+ deaths).

2. Across the United States, which types of events have the greatest economic consequences?

# sum property and crop damages figures per EVTYPE2
prop_crop_dmg_per_type <- repdata[, c('EVTYPE2', 'PROPDMGDOLLARS', 'CROPDMGDOLLARS')] %>%
                           group_by(EVTYPE2) %>%
                           summarise_all(funs("sum"))

# add a column with sum of both property and crop damages
prop_crop_dmg_per_type$PROPDMG_CROPDMGDOLLARS <- apply(prop_crop_dmg_per_type[c('PROPDMGDOLLARS', 'CROPDMGDOLLARS')],
                                                  1,
                                                  sum,
                                                  na.rm = TRUE)

# order the dataframe with sum of damagegs (in both property and crop) per EVTYPE2
prop_crop_dmg_per_type <- prop_crop_dmg_per_type[with(prop_crop_dmg_per_type,
                                                  order(PROPDMG_CROPDMGDOLLARS,
                                                        na.last = TRUE,
                                                        decreasing = TRUE)), ]
# plot in a bargraph the 10 weather events with greatest economical impacts
ggplot(prop_crop_dmg_per_type[1:20, ],
       aes(x = EVTYPE2, y = PROPDMG_CROPDMGDOLLARS, fill = PROPDMG_CROPDMGDOLLARS)) +
        ggtitle('Property and crop damages by 20 most harmful weather events') +
        labs(x = 'Weather events, ordered based on the total of property and crop damages',
             y = 'Damages amounts in actual dollar (property in blue, crop in brown)') +
        geom_bar(stat = "identity",
                 width = .7,
                 position = "dodge") +
        scale_fill_gradient(low = "skyblue1",
                            high = "dodgerblue4",
                            name = "Amount in actual dollars") +
        geom_bar(data = prop_crop_dmg_per_type[1:20, ],
                 aes(x = EVTYPE2, y = CROPDMGDOLLARS),
                 stat = "identity",
                 position = "dodge",
                 color = NA,
                 fill = "darkgoldenrod4",
                 width = .7) +
        guides(fill = guide_colourbar(order = 1),
               alpha = guide_legend(order = 2)) +
        theme(legend.text = element_text(size = 12,
                                         family = "Trebuchet MS",
                                         color = "#666666",
                                         face = "bold",
                                         hjust = 0),
              legend.title = element_text(size = 12,
                                          family = "Trebuchet MS",
                                          color = "#666666",
                                          face = "bold",
                                          hjust = 0),
              axis.text.x = element_text(angle = 65, hjust = 1),
              plot.title = element_text(family = "Trebuchet MS",
                                        color = "#666666",
                                        face = "bold",
                                        size = 18,
                                        hjust = 0),
              axis.title = element_text(family = "Trebuchet MS",
                                        color = "#666666",
                                        face = "bold",
                                        size = 10)) +
        scale_x_discrete(limits = prop_crop_dmg_per_type[1:20,'EVTYPE2'][[1]])

According to the bar chart above, flooding (including flash flooding) is the event with greatest economical impact with more than 180 billions of dollars, including 12+ billions of dollars of crop damages, since 1950. Are following the hurricanes with 90 billions of dollars (incl. 5,5 billions dollars for crop), tornadoes with 57+ billions of dollars (incl. 400+ millions dollars for crop damages).