NOAA Storm Database Analysis

Synopsis
Getting the data
Data Processing
- Data cleaning
Results
- Which types of events are most harmful with respect to population health?
- Which types of events have the greatest economic consequences?
Disclaimer

Synopsis

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database, which tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

Most of the variables reported in the dataset we do not need for this analysis, so first of all I dropped all the unuseful features.

These data were gathered from 1950 to 2011, with earlier years being annotated less strictly than recent years, when more precise rules were established to record natural events properly. This means a thorough cleaning process is needed before being able to actually perform any analysis on this dataset. The main topic of the analysis is to find specific event types causing most human or economic damages, so I fixed the event type values to be as compliant as possible with NOAA specifications.

I calculated the total number of fatalities and injuries per each event type, and in the same way I obtained the event type causing the greatest economic consequences. I assumed “economic consequences” refers to both property and crop damage, since these two can affect the economy to more or less the same extent.

Getting the data

Data can be downloaded from this link; there is no need to unzip it, since the read_csv function can deal with zipped data without issues.

download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", 
              "storm_data.csv.bz2")

Data Processing

library(tidyverse)
library(skimr)
library(stringr)
library(magrittr)
library(gridExtra)

Let’s load the data into R and have a quick look.

df <- read_csv("storm_data.csv.bz2")

head(df)

## # A tibble: 6 x 37
##   STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
##     <dbl> <chr>    <chr>    <chr>      <dbl> <chr>      <chr> <chr>      <dbl> <chr>   <chr>      <chr>    <chr>         <dbl>
## 1       1 4/18/19… 0130     CST           97 MOBILE     AL    TORNA…         0 <NA>    <NA>       <NA>     <NA>              0
## 2       1 4/18/19… 0145     CST            3 BALDWIN    AL    TORNA…         0 <NA>    <NA>       <NA>     <NA>              0
## 3       1 2/20/19… 1600     CST           57 FAYETTE    AL    TORNA…         0 <NA>    <NA>       <NA>     <NA>              0
## 4       1 6/8/195… 0900     CST           89 MADISON    AL    TORNA…         0 <NA>    <NA>       <NA>     <NA>              0
## 5       1 11/15/1… 1500     CST           43 CULLMAN    AL    TORNA…         0 <NA>    <NA>       <NA>     <NA>              0
## 6       1 11/15/1… 2000     CST           77 LAUDERDALE AL    TORNA…         0 <NA>    <NA>       <NA>     <NA>              0
## # ... with 23 more variables: COUNTYENDN <chr>, END_RANGE <dbl>, END_AZI <chr>, END_LOCATI <chr>, LENGTH <dbl>, WIDTH <dbl>,
## #   F <int>, MAG <dbl>, FATALITIES <dbl>, INJURIES <dbl>, PROPDMG <dbl>, PROPDMGEXP <chr>, CROPDMG <dbl>, CROPDMGEXP <chr>,
## #   WFO <chr>, STATEOFFIC <chr>, ZONENAMES <chr>, LATITUDE <dbl>, LONGITUDE <dbl>, LATITUDE_E <dbl>, LONGITUDE_ <dbl>,
## #   REMARKS <chr>, REFNUM <dbl>

Let’s restrict the data to just relevant variables for our research.

sub_df <- df %>% 
    select(-c(STATE__, BGN_DATE, BGN_TIME, TIME_ZONE, COUNTY, COUNTYNAME, STATE, BGN_RANGE, BGN_AZI, BGN_LOCATI, 
              END_DATE, END_TIME, COUNTY_END, COUNTYENDN, END_RANGE, END_AZI, END_LOCATI, LENGTH, WIDTH, 
              STATEOFFIC, ZONENAMES, LATITUDE, LONGITUDE, LATITUDE_E, LONGITUDE_, REMARKS))
head(sub_df)

## # A tibble: 6 x 11
##   EVTYPE      F   MAG FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO   REFNUM
##   <chr>   <int> <dbl>      <dbl>    <dbl>   <dbl> <chr>        <dbl> <chr>      <chr>  <dbl>
## 1 TORNADO     3     0          0       15    25   K                0 <NA>       <NA>       1
## 2 TORNADO     2     0          0        0     2.5 K                0 <NA>       <NA>       2
## 3 TORNADO     2     0          0        2    25   K                0 <NA>       <NA>       3
## 4 TORNADO     2     0          0        2     2.5 K                0 <NA>       <NA>       4
## 5 TORNADO     2     0          0        2     2.5 K                0 <NA>       <NA>       5
## 6 TORNADO     2     0          0        6     2.5 K                0 <NA>       <NA>       6

Now let’s check for missing values and other issues with these data.

skim(sub_df)

## Skim summary statistics
##  n obs: 902297 
##  n variables: 11 
## 
## ── Variable type:character ─────────────────────────────────────────────────────────────────────────────────────────
##    variable missing complete      n min max empty n_unique
##  CROPDMGEXP  618413   283884 902297   1   1     0        8
##      EVTYPE       0   902297 902297   1  30     0      977
##  PROPDMGEXP  465934   436363 902297   1   1     0       18
##         WFO  142069   760228 902297   1   3     0      541
## 
## ── Variable type:integer ───────────────────────────────────────────────────────────────────────────────────────────
##  variable missing complete      n mean sd p0 p25 p50 p75 p100     hist
##         F  843563    58734 902297 0.91  1  0   0   1   1    5 ▇▆▁▃▁▁▁▁
## 
## ── Variable type:numeric ───────────────────────────────────────────────────────────────────────────────────────────
##    variable missing complete      n       mean        sd p0    p25    p50      p75  p100     hist
##     CROPDMG       0   902297 902297      1.53      22.17  0      0      0      0     990 ▇▁▁▁▁▁▁▁
##  FATALITIES       0   902297 902297      0.017      0.77  0      0      0      0     583 ▇▁▁▁▁▁▁▁
##    INJURIES       0   902297 902297      0.16       5.43  0      0      0      0    1700 ▇▁▁▁▁▁▁▁
##         MAG       0   902297 902297     46.9       61.91  0      0     50     75   22000 ▇▁▁▁▁▁▁▁
##     PROPDMG       0   902297 902297     12.06      59.48  0      0      0      0.5  5000 ▇▁▁▁▁▁▁▁
##      REFNUM       0   902297 902297 451149     260470.85  1 225575 451149 676723   9e+05 ▇▇▇▇▇▇▇▇

There is something interesting here: the EVTYPE variable shows 977 unique values, although in the data documentation only 48 event types are reported.
We may think that something’s wrong, so let’s have a look at some of these values.

sub_df %>% 
    count(EVTYPE) %>% 
    arrange(desc(n))

## # A tibble: 977 x 2
##    EVTYPE                  n
##    <chr>               <int>
##  1 HAIL               288661
##  2 TSTM WIND          219944
##  3 THUNDERSTORM WIND   82563
##  4 TORNADO             60652
##  5 FLASH FLOOD         54278
##  6 FLOOD               25326
##  7 THUNDERSTORM WINDS  20843
##  8 HIGH WIND           20212
##  9 LIGHTNING           15755
## 10 HEAVY SNOW          15708
## # ... with 967 more rows

As we can see, there are a number of values which are mispelled or poorly encoded. This means we need some cleaning here.

Data cleaning

A deep inspection of the EVTYPE variable shows a very high number of different values, which highlights the need for a very complicated strategy to fix this issue. Instead, I chose to select the first 200 most common elements and fix the wrong entries in this list, according to the original 48 event types; all the other entries will be left out of the analysis.

top_200 <- sub_df %>% 
    count(EVTYPE) %>% 
    arrange(desc(n)) %>% 
    top_n(200, n)

top_df <- sub_df %>% 
    filter(EVTYPE %in% top_200$EVTYPE) %>% 
    mutate(EVTYPE = toupper(EVTYPE))

Now I will perform a (quite boring) replacement of messy event with their proper counterpart from the data documentation.

top_df %<>% 
    mutate(event = case_when(EVTYPE %in% c("THUNDERSTORM WIND", "TSTM WIND", "THUNDERSTORM WINDS", "TSTM WIND/HAIL", 
                                           "THUNDERSTORM WINDS HAIL", "THUNDERSTORM WINDSS", "THUNDERSTORM", 
                                           "TSTM WIND (G45)", "THUNDERSTORM WINDS/HAIL", "SEVERE THUNDERSTORMS", 
                                           "THUNDERSTORMS WINDS", "SEVERE THUNDERSTORM", "TSTM WIND (G40)", 
                                           "THUNDERSTORM  WINDS") ~ "THUNDERSTORM WIND", 
                             EVTYPE %in% c("MARINE THUNDERSTORM WIND", "MARINE TSTM WIND", 
                                           "COASTAL STORM") ~ "MARINE THUNDERSTORM WIND", 
                             EVTYPE %in% c("FLOOD", "URBAN/SML STREAM FLD", "FLOOD/FLASH FLOOD", "URBAN FLOOD", 
                                           "RIVER FLOOD", "FLOODING", "URBAN FLOODING", "URBAN/SMALL STREAM FLOOD", 
                                           "RIVER FLOODING", "SMALL STREAM FLOOD") ~ "FLOOD", 
                             EVTYPE %in% c("HIGH WIND", "HIGH WINDS", "WIND", "DRY MICROBURST", "WINDS", 
                                           "WIND ADVISORY") ~ "HIGH WIND", 
                             EVTYPE %in% c("WILDFIRE", "WILD/FOREST FIRE", "WILDFIRES") ~ "WILDFIRE", 
                             EVTYPE %in% c("WINTER WEATHER", "WINTER WEATHER/MIX", 
                                           "LOW TEMPERATURE") ~ "WINTER WEATHER", 
                             EVTYPE %in% c("FLASH FLOOD", "FLASH FLOODING", "FLASH FLOODS", 
                                           "FLASH FLOOD/FLOOD", "FLASH FLOODING/FLOOD") ~ "FLASH FLOOD", 
                             EVTYPE %in% c("EXTREME COLD/WIND CHILL", "EXTREME COLD", "EXTREME WINDCHILL", 
                                           "RECORD COLD", "UNSEASONABLY COLD", 
                                           "EXTREME WINDCHILL TEMPERATURES") ~ "EXTREME COLD/WIND CHILL", 
                             EVTYPE %in% c("AVALANCHE", "LANDSLIDE", "MUDSLIDE", "LANDSLIDES", 
                                           "MUD SLIDE") ~ "AVALANCHE", 
                             EVTYPE %in% c("HEAVY SNOW", "SNOW", "HEAVY SNOW SQUALLS", "EXCESSIVE SNOW", 
                                           "SNOW SQUALL", "SNOW SQUALLS", "HEAVY SNOW-SQUALLS", "BLOWING SNOW", 
                                           "RECORD SNOW", "SNOW/BLOWING SNOW", "SNOW/ICE") ~ "HEAVY SNOW", 
                             EVTYPE %in% c("DENSE FOG", "FOG", "GLAZE") ~ "DENSE FOG", 
                             EVTYPE %in% c("RIP CURRENT", "RIP CURRENTS") ~ "RIP CURRENT", 
                             EVTYPE %in% c("STORM SURGE/TIDE", "STORM SURGE", "HIGH SEAS") ~ "STORM SURGE/TIDE", 
                             EVTYPE %in% c("HAIL", "FREEZING RAIN", "SMALL HAIL", "HAIL 75", "HAIL 0.75", 
                                           "HAIL 100", "HAIL 175", "SNOW FREEZING RAIN", 
                                           "NON SEVERE HAIL") ~ "HAIL", 
                             EVTYPE %in% c("HIGH SURF", "HEAVY SURF/HIGH SURF", "HEAVY SURF") ~ "HIGH SURF", 
                             EVTYPE %in% c("STRONG WIND", "STRONG WINDS", "GUSTY WINDS", "WIND DAMAGE", 
                                           "GUSTY WIND", "GRADIENT WINDS") ~ "STRONG WIND", 
                             EVTYPE %in% c("HURRICANE (TYPHOON)", "HURRICANE", "HURRICANE/TYPHOON", 
                                           "TYPHOON", "HURRICANE OPAL", 
                                           "HURRICANE ERIN") ~ "HURRICANE (TYPHOON)", 
                             EVTYPE %in% c("SLEET", "LIGHT SNOW", "MODERATE SNOWFALL", "WINTRY MIX", 
                                           "LIGHT FREEZING RAIN", "FREEZING DRIZZLE", "SLEET STORM", 
                                           "SNOW/SLEET", "FREEZING RAIN/SLEET") ~ "SLEET", 
                             EVTYPE %in% c("HEAT", "RECORD WARMTH", "UNSEASONABLY HOT", "UNUSUAL WARMTH", "DRY", 
                                           "UNSEASONABLY WARM", "UNSEASONABLY DRY", 
                                           "UNSEASONABLY WARM AND DRY") ~ "HEAT", 
                             EVTYPE %in% c("COASTAL FLOOD", "COASTAL FLOODING", "TIDAL FLOODING") ~ "COASTAL FLOOD", 
                             EVTYPE %in% c("EXCESSIVE HEAT", "RECORD HEAT", "HEAT WAVE", 
                                           "EXTREME HEAT") ~ "EXCESSIVE HEAT", 
                             # not very sure about the following one, but since in the data 
                             # documentation only an Astronomical Low Tide entry 
                             # is reported, I'm assuming this conversion is appropriate
                             EVTYPE %in% c("ASTRONOMICAL LOW TIDE", 
                                           "ASTRONOMICAL HIGH TIDE") ~ "ASTRONOMICAL LOW TIDE", 
                             EVTYPE %in% c("FUNNEL CLOUD", "FUNNEL CLOUDS", "FUNNEL") ~ "FUNNEL CLOUD", 
                             EVTYPE %in% c("FROST/FREEZE", "FREEZE", "ICE", "FROST", "ICY ROADS", "BLACK ICE", 
                                           "HARD FREEZE") ~ "FROST/FREEZE", 
                             EVTYPE %in% c("COLD/WIND CHILL", "COLD", "WIND CHILL", "PROLONG COLD", 
                                           "UNSEASONABLY COOL", "UNUSUALLY COLD") ~ "COLD/WIND CHILL", 
                             EVTYPE %in% c("WATERSPOUT", "WATERSPOUTS", "WATERSPOUT-", 
                                           "WATERSPOUT/TORNADO") ~ "WATERSPOUT", 
                             EVTYPE %in% c("HEAVY RAIN", "HEAVY RAINS", "UNSEASONABLY WET", 
                                           "RAIN", "RECORD RAINFALL", "HEAVY RAINS/FLOODING") ~ "HEAVY RAIN", 
                             EVTYPE %in% c("ICE STORM", "SNOW AND ICE", "SNOW/ICE STORM") ~ "ICE STORM", 
                             EVTYPE %in% c("LAKE-EFFECT SNOW", "HEAVY LAKE SNOW", 
                                           "LAKE EFFECT SNOW") ~ "LAKE-EFFECT SNOW", 
                             EVTYPE %in% c("TORNADO", "TORNADO F0") ~ "TORNADO", 
                             EVTYPE %in% c("DROUGHT", "DROUGHT/EXCESSIVE HEAT", "SNOW DROUGHT") ~ "DROUGHT", 
                             EVTYPE %in% c("DENSE SMOKE", "SMOKE") ~ "DENSE SMOKE", 
                             EVTYPE %in% c("LIGHTNING", "THUNDERSTORM WINDS LIGHTNING") ~ "LIGHTNING", 
                             EVTYPE == "BLIZZARD" ~ "BLIZZARD", 
                             EVTYPE == "DEBRIS FLOW" ~ "DEBRIS FLOW", 
                             EVTYPE == "DUST DEVIL" ~ "DUST DEVIL", 
                             EVTYPE == "DUST STORM" ~ "DUST STORM", 
                             EVTYPE == "FREEZING FOG" ~ "FREEZING FOG", 
                             EVTYPE == "LAKESHORE FLOOD" ~ "LAKESHORE FLOOD", 
                             EVTYPE == "MARINE HAIL" ~ "MARINE HAIL", 
                             EVTYPE == "MARINE HIGH WIND" ~ "MARINE HIGH WIND", 
                             EVTYPE == "MARINE STRONG WIND" ~ "MARINE STRONG WIND", 
                             EVTYPE == "SEICHE" ~ "SEICHE", 
                             EVTYPE == "TROPICAL DEPRESSION" ~ "TROPICAL DEPRESSION", 
                             EVTYPE == "TROPICAL STORM" ~ "TROPICAL STORM", 
                             EVTYPE == "TSUNAMI" ~ "TSUNAMI", 
                             EVTYPE == "VOLCANIC ASH" ~ "VOLCANIC ASH", 
                             EVTYPE == "WINTER STORM" ~ "WINTER STORM", 
                             TRUE ~ "OTHER"))

Although this could result in an essentially flawed analysis, it should be enough to get an idea of the data (and to show how to achieve the main aims of the analysis).
The clean dataset now looks like this, and is ready to help us with our questions.

head(top_df)

## # A tibble: 6 x 12
##   EVTYPE      F   MAG FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO   REFNUM event  
##   <chr>   <int> <dbl>      <dbl>    <dbl>   <dbl> <chr>        <dbl> <chr>      <chr>  <dbl> <chr>  
## 1 TORNADO     3     0          0       15    25   K                0 <NA>       <NA>       1 TORNADO
## 2 TORNADO     2     0          0        0     2.5 K                0 <NA>       <NA>       2 TORNADO
## 3 TORNADO     2     0          0        2    25   K                0 <NA>       <NA>       3 TORNADO
## 4 TORNADO     2     0          0        2     2.5 K                0 <NA>       <NA>       4 TORNADO
## 5 TORNADO     2     0          0        2     2.5 K                0 <NA>       <NA>       5 TORNADO
## 6 TORNADO     2     0          0        6     2.5 K                0 <NA>       <NA>       6 TORNADO

Results

Which types of events are most harmful with respect to population health?

In order to find the answer to this question, we want to consider both the FATALITIES and INJURIES variables.

top_fatal <- top_df %>% 
    group_by(event) %>% 
    summarise(tot_fatal = sum(FATALITIES)) %>% 
    arrange(desc(tot_fatal)) %>%
    top_n(5, tot_fatal) 
top_fatal

## # A tibble: 5 x 2
##   event          tot_fatal
##   <chr>              <dbl>
## 1 TORNADO             5633
## 2 EXCESSIVE HEAT      2173
## 3 FLASH FLOOD         1018
## 4 HEAT                 977
## 5 LIGHTNING            816

top_injur <- top_df %>% 
    group_by(event) %>% 
    summarise(tot_injur = sum(INJURIES)) %>% 
    arrange(desc(tot_injur)) %>% 
    top_n(5, tot_injur) 
top_injur

## # A tibble: 5 x 2
##   event             tot_injur
##   <chr>                 <dbl>
## 1 TORNADO               91346
## 2 THUNDERSTORM WIND      9480
## 3 EXCESSIVE HEAT         7039
## 4 FLOOD                  6887
## 5 LIGHTNING              5230

fatal_plot <- top_fatal %>% 
    ggplot(aes(x = reorder(event, tot_fatal), y = tot_fatal, fill = event)) + 
    geom_col() + 
    coord_flip() + 
    scale_fill_brewer(palette = "Set2") + 
    guides(fill = FALSE) + 
    theme_bw() + 
    labs(x = "Event", y = "Fatalities", title = "Top 5 events by fatalities", 
         subtitle = "Tornadoes account for the highest number of fatalities.")
injur_plot <- top_injur %>% 
    ggplot(aes(x = reorder(event, tot_injur), y = tot_injur, fill = event)) + 
    geom_col() + 
    coord_flip() + 
    scale_fill_brewer(palette = "Paired") + 
    guides(fill = FALSE) + 
    theme_bw() + 
    labs(x = "Event", y = "Injuries", title = "Top 5 events by injuries", 
         subtitle = "Tornadoes account for the highest number of injuries.")

grid.arrange(fatal_plot, injur_plot, nrow = 2)

From this analysis, we can see that the most harmful event in terms of population health are tornadoes, with 5633 fatalities and 91346 injuries, and this is true for both fatalities and injuries.

These plots also highlight the extreme dangerousness of tornadoes, which definitely outnumber all the other event types, and this is particularly true for the number of injuries.

Which types of events have the greatest economic consequences?

In order to answer this question, we have to look at the PROPDMG and CROPDMG variables, but we shall notice that both features have some kind of twin-feature, respectively PROPDMGEXP and CROPDMGEXP, which specify the magnitude of the given value. From the data documentation:

[Damage] Estimates should be rounded to three significant digits, followed by an alphabetical character signifying the magnitude of the number, i.e., 1.55B for $1,550,000,000. Alphabetical characters used to signify magnitude include “K” for thousands, “M” for millions, and “B” for billions.

Let’s check that everything’s alright with the PROPDMGEXP and CROPDMGEXP variables.

top_df %>% 
    count(PROPDMGEXP)

## # A tibble: 18 x 2
##    PROPDMGEXP      n
##    <chr>       <int>
##  1 -               1
##  2 ?               8
##  3 +               3
##  4 0             215
##  5 1              25
##  6 2              13
##  7 3               4
##  8 4               4
##  9 5              27
## 10 6               4
## 11 7               5
## 12 8               1
## 13 B              37
## 14 H               6
## 15 K          424390
## 16 m               7
## 17 M           11269
## 18 <NA>       464889

top_df %>% 
    count(CROPDMGEXP)

## # A tibble: 9 x 2
##   CROPDMGEXP      n
##   <chr>       <int>
## 1 ?               6
## 2 0              18
## 3 2               1
## 4 B               9
## 5 k              21
## 6 K          281783
## 7 m               1
## 8 M            1963
## 9 <NA>       617106

I have honestly no idea what other values like H, + or numbers might mean, so I will just ignore these and pretend they all represent thousands (i.e. K), since this is the lowest unit allowed.

We should first convert each couple of features into specific prop_dmg and crop_dmg variables with the proper values; after that, since the question does not focus on one particular type of damage, my guess is that both these values equally account for economic consequences, so I will sum them to obtain a total_dmg column.

econ_df <- top_df %>% 
    mutate(prop_dmg = case_when(toupper(PROPDMGEXP) == "M" ~ PROPDMG * 1000000, 
                                toupper(PROPDMGEXP) == "B" ~ PROPDMG * 1000000000, 
                                TRUE ~ PROPDMG * 1000), 
           crop_dmg = case_when(toupper(CROPDMGEXP) == "M" ~ CROPDMG * 1000000, 
                                toupper(CROPDMGEXP) == "B" ~ CROPDMG * 1000000000, 
                                TRUE ~ CROPDMG * 1000), 
           total_dmg = prop_dmg + crop_dmg) %>% 
    select(event, prop_dmg, crop_dmg, total_dmg)
econ_df

## # A tibble: 900,908 x 4
##    event   prop_dmg crop_dmg total_dmg
##    <chr>      <dbl>    <dbl>     <dbl>
##  1 TORNADO    25000        0     25000
##  2 TORNADO     2500        0      2500
##  3 TORNADO    25000        0     25000
##  4 TORNADO     2500        0      2500
##  5 TORNADO     2500        0      2500
##  6 TORNADO     2500        0      2500
##  7 TORNADO     2500        0      2500
##  8 TORNADO     2500        0      2500
##  9 TORNADO    25000        0     25000
## 10 TORNADO    25000        0     25000
## # ... with 900,898 more rows

Total economic damage

Now we can finally look for the most economy-damaging event.

top_damage <- econ_df %>% 
    group_by(event) %>% 
    summarise(event_dmg = sum(total_dmg)) %>% 
    arrange(desc(event_dmg)) %>% 
    top_n(10, event_dmg) 
top_damage

## # A tibble: 10 x 2
##    event                  event_dmg
##    <chr>                      <dbl>
##  1 FLOOD               160956469850
##  2 HURRICANE (TYPHOON)  90710952810
##  3 TORNADO              57352655690
##  4 STORM SURGE/TIDE     47965594500
##  5 HAIL                 18789023370
##  6 FLASH FLOOD          18169283860
##  7 DROUGHT              15018677780
##  8 THUNDERSTORM WIND    12217487730
##  9 ICE STORM             8967191360
## 10 TROPICAL STORM        8382236550

top_damage %>% 
    ggplot(aes(x = reorder(event, event_dmg), y = event_dmg, fill = event)) + 
    geom_col() + 
    coord_flip() + 
    scale_fill_brewer(palette = "Set3") + 
    scale_y_continuous(labels = c("0", "50B", "100B", "150B")) +
    guides(fill = FALSE) + 
    theme_bw() + 
    labs(x = "Event", y = "Estimated damage ($)", title = "Top 10 events by economic damage", 
         subtitle = "Floods are responsible for the greatest economic consequences.")

From our analysis it looks like floods determined the greatest economic consequences, causing damages to properties and crops for about 160 956 469 850 $, followed by hurricanes and tornadoes.

Properties vs crops

If we’re curious about separating property damage from crop damage, we can easily split our analysis.

top_split_damage <- econ_df %>% 
    group_by(event) %>% 
    summarise(tot_prop = sum(prop_dmg), 
              tot_crop = sum(crop_dmg)) 
prop_plot <- top_split_damage %>% 
    arrange(desc(tot_prop)) %>% 
    top_n(5, tot_prop) %>% 
    ggplot(aes(x = reorder(event, tot_prop), y = tot_prop, fill = event)) + 
    geom_col() + 
    coord_flip() +
    scale_fill_brewer(palette = "Set2") + 
    scale_y_continuous(labels = c("0", "50B", "100B", "150B")) +
    guides(fill = FALSE) + 
    theme_bw() + 
    labs(x = "Event", y = "Estimated damage ($)", title = "Top 5 events by property damage", 
         subtitle = "Floods account for the highest amount of damage caused to properties.")
crop_plot <- top_split_damage %>% 
    arrange(desc(tot_crop)) %>% 
    top_n(5, tot_crop) %>% 
    ggplot(aes(x = reorder(event, tot_crop), y = tot_crop, fill = event)) + 
    geom_col() + 
    coord_flip() +
    scale_fill_brewer(palette = "Paired") + 
    scale_y_continuous(labels = c("0", "5B", "10B", "15B")) +
    guides(fill = FALSE) + 
    theme_bw() + 
    labs(x = "Event", y = "Estimated damage ($)", title = "Top 5 events by crop damage", 
         subtitle = "Droughts account for the highest amount of damage caused to crops.")
grid.arrange(prop_plot, crop_plot, nrow = 2)

Interestingly, we found that droughts are causing most damage to crops, although for amounts ten times lower than those caused to properties by floods.

Disclaimer

This data analysis was performed as an assignment of the Reproducible Research course, which is part of the Data Science Specialization by Coursera. It is meant to be a toy analysis useful to practice key concept of reproducible data analysis.
No conclusions should be drawn from what is reported here.

sessionInfo()

## R version 3.5.1 (2018-07-02)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS  10.14
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] bindrcpp_0.2.2  gridExtra_2.3   magrittr_1.5    skimr_1.0.3     forcats_0.3.0   stringr_1.3.1   dplyr_0.7.6    
##  [8] purrr_0.2.5     readr_1.1.1     tidyr_0.8.1     tibble_1.4.2    ggplot2_3.0.0   tidyverse_1.2.1
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_0.2.4   haven_1.1.2        lattice_0.20-35    colorspace_1.3-2   htmltools_0.3.6    yaml_2.2.0        
##  [7] utf8_1.1.4         rlang_0.2.2        pillar_1.3.0       glue_1.3.0         withr_2.1.2        RColorBrewer_1.1-2
## [13] modelr_0.1.2       readxl_1.1.0       bindr_0.1.1        plyr_1.8.4         munsell_0.5.0      gtable_0.2.0      
## [19] cellranger_1.1.0   rvest_0.3.2        evaluate_0.11      labeling_0.3       knitr_1.20         fansi_0.3.0       
## [25] broom_0.5.0        Rcpp_0.12.18       scales_1.0.0       backports_1.1.2    jsonlite_1.5       hms_0.4.2         
## [31] digest_0.6.17      stringi_1.2.4      grid_3.5.1         rprojroot_1.3-2    cli_1.0.0          tools_3.5.1       
## [37] lazyeval_0.2.1     crayon_1.3.4       pkgconfig_2.0.2    xml2_1.2.0         lubridate_1.7.4    assertthat_0.2.0  
## [43] rmarkdown_1.10     httr_1.3.1         rstudioapi_0.7     R6_2.2.2           nlme_3.1-137       compiler_3.5.1