Synopsis

The basic goal of this report is to explore the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database and analyse the damage caused by certain types of severe weather events.

This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

This report focuses on answering two main questions: 1. Across the United States, which types of events are most harmful with respect to population health? 2. Across the United States, which types of events have the greatest economic consequences?

As we are investigating only the damage done to people and property, only a small subset of the collected data will be relevant for us.

Data Processing

library("dplyr")
library("ggplot2")

First of all we need the data itself. If the .bz2 archive is not yet available, we try to download it from the specified URL.

input_url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
input_archive <- "StormData.csv.bz2"

if (!file.exists(input_archive)) {
    message("Downloading '", input_url, "'")
    download.file(input_url, input_archive)
}

Loading the data

Once we have the archive, we can load its content. As there are more than 900k records, loading them takes a considerable time, so we can speed up subsequent runs by caching it in an .rds file.

We will need only 7 columns of the 37, so the memory usage can be greatly reduced if we read only the needed columns:

  • “EVTYPE”
    The type of the weather event, eg. “TORNADO”, “WILDFIRE”, “ASTRONOMICAL HIGH TIDE”
  • “FATALITIES” and “INJURIES”
    The number of people affected by the event
  • “PROPDMG” and “PROPDMGEXP”
    The property damage caused by the event, the latter one encoding the exponent with letters ‘K’, ‘M’, and ‘B’ for
    thousands, millions and billions.
  • “CROPDMG” and “CROPDMGEXP”
    The damage done to crops, represented in the same way.

For more details please refer to the Raw Data Documentation.

input_rds <- "StormData.rds"

if (file.exists(input_rds)) {
    storm <- readRDS(input_rds)
} else {
    message("Reading '", input_archive, "', please wait")
    storm <- tibble::as_tibble(read.csv(input_archive,
                                        na.strings = c("", "0.00"),
                                        # read only those columns we actually need
                                        colClasses = c("NULL",      # "STATE__"
                                                       "NULL",      # "BGN_DATE"
                                                       "NULL",      # "BGN_TIME"
                                                       "NULL",      # "TIME_ZONE"
                                                       "NULL",      # "COUNTY"
                                                       "NULL",      # "COUNTYNAME"
                                                       "NULL",      # "STATE"
                                                       "factor",    # "EVTYPE"
                                                       "NULL",      # "BGN_RANGE"
                                                       "NULL",      # "BGN_AZI"
                                                       "NULL",      # "BGN_LOCATI"
                                                       "NULL",      # "END_DATE"
                                                       "NULL",      # "END_TIME"
                                                       "NULL",      # "COUNTY_END"
                                                       "NULL",      # "COUNTYENDN"
                                                       "NULL",      # "END_RANGE"
                                                       "NULL",      # "END_AZI"
                                                       "NULL",      # "END_LOCATI"
                                                       "NULL",      # "LENGTH"
                                                       "NULL",      # "WIDTH"
                                                       "NULL",      # "F"
                                                       "NULL",      # "MAG"
                                                       "numeric",   # "FATALITIES"
                                                       "numeric",   # "INJURIES"
                                                       "numeric",   # "PROPDMG"
                                                       "character", # "PROPDMGEXP"
                                                       "numeric",   # "CROPDMG"
                                                       "character", # "CROPDMGEXP"
                                                       "NULL",      # "WFO"
                                                       "NULL",      # "STATEOFFIC"
                                                       "NULL",      # "ZONENAMES"
                                                       "NULL",      # "LATITUDE"
                                                       "NULL",      # "LONGITUDE"
                                                       "NULL",      # "LATITUDE_E"
                                                       "NULL",      # "LONGITUDE_"
                                                       "NULL",      # "REMARKS"
                                                       "NULL")      # "REFNUM"
                                        )) %>%
             mutate(FATALITIES = as.integer(FATALITIES),
                    INJURIES   = as.integer(INJURIES),
                    PROPDMGEXP = as.factor(toupper(PROPDMGEXP)),
                    CROPDMGEXP = as.factor(toupper(CROPDMGEXP)))

    message("Saving cache '", input_rds, "'")
    saveRDS(storm, input_rds)
}

summary(storm)
##                EVTYPE         FATALITIES        INJURIES     
##  HAIL             :288661   Min.   :  1.0    Min.   :   1    
##  TSTM WIND        :219940   1st Qu.:  1.0    1st Qu.:   1    
##  THUNDERSTORM WIND: 82563   Median :  1.0    Median :   2    
##  TORNADO          : 60652   Mean   :  2.2    Mean   :   8    
##  FLASH FLOOD      : 54277   3rd Qu.:  2.0    3rd Qu.:   4    
##  FLOOD            : 25326   Max.   :583.0    Max.   :1700    
##  (Other)          :170878   NA's   :895323   NA's   :884693  
##     PROPDMG         PROPDMGEXP        CROPDMG       CROPDMGEXP   
##  Min.   :   0.0   K      :424665   Min.   :  0.0    ?   :     7  
##  1st Qu.:   2.5   M      : 11337   1st Qu.:  5.0    0   :    19  
##  Median :  10.0   0      :   216   Median : 10.0    2   :     1  
##  Mean   :  45.5   B      :    40   Mean   : 62.3    B   :     9  
##  3rd Qu.:  25.0   5      :    28   3rd Qu.: 50.0    K   :281853  
##  Max.   :5000.0   (Other):    77   Max.   :990.0    M   :  1995  
##  NA's   :663123   NA's   :465934   NA's   :880198   NA's:618413

Checking data validity

According to the Documentation, PROPDMGEXP and CROPDMGEXP may take only “K”, “M” or “B”, but it seems that there are about 300 cases when PROPDMGEXP has some other value, and 27 similar for CROPDMGEXP.

To figure out how to handle these cases, let’s take a look at their characteristic values:

summary(storm$PROPDMG[!is.na(storm$PROPDMGEXP) & !(storm$PROPDMGEXP %in% c("K", "M", "B"))])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.10   10.00   25.00   30.56   50.00  150.00      70

We may compare them to the ones of the known exponents, perhaps they represent only mistyped “K”/“M”/“B” values:

summary(storm$PROPDMG[is.na(storm$PROPDMGEXP)])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.4     3.0     5.0     6.9     7.2    75.0  465858
summary(storm$PROPDMG[storm$PROPDMGEXP == "K"])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0     2.5    10.0    47.2    30.0  5000.0  663118
summary(storm$PROPDMG[storm$PROPDMGEXP == "M"])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0     1.5     2.5    12.4     6.0   929.0  465945
summary(storm$PROPDMG[storm$PROPDMGEXP == "B"])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.1     1.4     2.3     6.9     5.0   115.0  465934

They aren’t really similar to any of the four valid categories.

On the other hand, they represent only 1/3000 of the records, so we might even just discard them, but there is a slightly better option:

If we just omitted them, then their whole value would be lost from the totals.

If we treat them as in the “magnitude of ones”, then if they indeed are, then we saved these observations, and otherwise we saved 1/1K, 1/1M or 1/1B of their represented values. Not too much, but still more than zero.

Preparing the relevant data

Having made this decision, we can restore the actual property- and crop-damage values and sort the events according to the total damage.

The same type of ordering is done to the human harm values as well.

As we want to display the event types according to the total harm/damage they done, we calculate these totals and sort the lists according to them.

The cumulative totals will be needed to find the events that are responsible for the top-X-percent of the harm/damage.

magnitude <- function(x) {
    ifelse(is.na(x), 1, recode(x, K=1e3, M=1e6, B=1e9, .default=1))
}

damage <- storm %>%
          filter(!is.na(PROPDMG) & !is.na(PROPDMGEXP) & !is.na(CROPDMG) & !is.na(CROPDMGEXP)) %>%
          transmute(EVTYPE,
                    PROPDMG = PROPDMG * magnitude(PROPDMGEXP),
                    CROPDMG = CROPDMG * magnitude(CROPDMGEXP)) %>%
          group_by(EVTYPE) %>%
          summarise(PROPERTY = sum(PROPDMG),
                    CROP = sum(CROPDMG)) %>%
          mutate(TOTAL_DAMAGE = PROPERTY + CROP) %>%
          arrange(desc(TOTAL_DAMAGE)) %>%
          mutate(CUMULATIVE_DAMAGE = cumsum(TOTAL_DAMAGE))

damage$EVTYPE <- factor(as.character(damage$EVTYPE), levels = damage$EVTYPE)

grand_total_damage <- max(damage$CUMULATIVE_DAMAGE)
damage <- damage %>%
          mutate(CUMULATIVE_PERCENTAGE = CUMULATIVE_DAMAGE / grand_total_damage,
                 CUMULATIVE_DAMAGE = NULL)


harm <- storm %>%
        select(EVTYPE, INJURIES, FATALITIES) %>%
        filter(!is.na(INJURIES) & !is.na(FATALITIES)) %>%
        group_by(EVTYPE) %>%
        summarise(INJURY = sum(INJURIES),
                  FATALITY = sum(FATALITIES)) %>%
        mutate(TOTAL_HARM = INJURY + FATALITY) %>%
        arrange(desc(TOTAL_HARM)) %>%
        mutate(CUMULATIVE_HARM = cumsum(TOTAL_HARM))

harm$EVTYPE <- factor(as.character(harm$EVTYPE), levels = harm$EVTYPE)

grand_total_harm <- max(harm$CUMULATIVE_HARM)
harm <- harm %>%
        mutate(CUMULATIVE_PERCENTAGE = CUMULATIVE_HARM / grand_total_harm,
               CUMULATIVE_HARM = NULL)

Results

Human harm

# helper function to display large integers in a readable form
label_for_integer <- function(x) {
    x <- x + 0.5
    if (is.na(x)) {
        ""
    } else if (x > 1e9) {
        sprintf("%.1fB", x * 1e-9)
    } else if (x > 1e6) {
        sprintf("%.1fM", x * 1e-6)
    } else if (x > 1e3) {
        sprintf("%.1fK", x * 1e-3)
    } else {
        sprintf("%d", as.integer(x))
    }
}

plot_harm <- ggplot(harm) +
             geom_col(aes(x = EVTYPE,
                          y = TOTAL_HARM,
                          fill="injury")) +
             geom_col(aes(x = EVTYPE,
                          y = FATALITY,
                          fill="fatality"),
                      position="stack") +
             scale_y_log10(breaks = function(r) { signif(10 ** seq(from = log10(r[1]),
                                                                   to = log10(r[2]),
                                                                   length.out = 10),
                                                         digits=2)},
                           labels = function(x) { sapply(x, label_for_integer) }) +
             scale_fill_manual(labels=c("injury", "fatality"), values=c("blue", "red")) +
             theme(panel.background = element_rect(fill = "white",
                                                   colour = "grey50"),
                   panel.grid.major.y = element_line(colour = "grey80"),
                   axis.text.x = element_text(angle = 90, hjust = 1)) +
             labs(x = "",
                  y = "Human harm (logarithmic)",
                  fill = "Harm type",
                  title = "Human harm by event type")
print(plot_harm)

harm[1:10, c("EVTYPE", "TOTAL_HARM", "CUMULATIVE_PERCENTAGE")]
## # A tibble: 10 x 3
##               EVTYPE TOTAL_HARM CUMULATIVE_PERCENTAGE
##               <fctr>      <int>                 <dbl>
##  1           TORNADO      65414             0.7555761
##  2    EXCESSIVE HEAT       5193             0.8155588
##  3             FLOOD       2783             0.8477043
##  4         ICE STORM       1755             0.8679757
##  5              HEAT       1493             0.8852209
##  6 HURRICANE/TYPHOON       1251             0.8996708
##  7         LIGHTNING        932             0.9104360
##  8         TSTM WIND        845             0.9201964
##  9       FLASH FLOOD        812             0.9295755
## 10          BLIZZARD        766             0.9384233

The top 3 event types account for about 85% of the harm, and the top 6 account for about 90%.

Property/crop damage

plot_damage <- ggplot(damage) +
               geom_col(aes(x = EVTYPE,
                            y = TOTAL_DAMAGE,
                            fill="property")) +
               geom_col(aes(x = EVTYPE,
                            y = CROP,
                            fill="crop"),
                        position="stack") +
               scale_y_log10(breaks = function(r) { signif(10 ** seq(from = log10(r[1]),
                                                                     to = log10(r[2]),
                                                                     length.out = 10),
                                                           digits=2)},
                           labels = function(x) { sapply(x, label_for_integer) }) +
               scale_fill_manual(labels=c("property", "crop"), values=c("blue", "red")) +
               theme(panel.background = element_rect(fill = "white",
                                                     colour = "grey50"),
                     panel.grid.major.y = element_line(colour = "grey80"),
                     axis.text.x = element_text(angle = 90, hjust = 1)) +
               labs(x = "",
                    y = "Property/crop damage (logarithmic)",
                    fill = "Damage type",
                    title = "Property/crop damage by event type")
print(plot_damage)

damage[1:10, c("EVTYPE", "TOTAL_DAMAGE", "CUMULATIVE_PERCENTAGE")]
## # A tibble: 10 x 3
##               EVTYPE TOTAL_DAMAGE CUMULATIVE_PERCENTAGE
##               <fctr>        <dbl>                 <dbl>
##  1             FLOOD 126044533500             0.6065763
##  2 HURRICANE/TYPHOON  29348117800             0.7478111
##  3         HURRICANE  10498188000             0.7983325
##  4       RIVER FLOOD  10108369000             0.8469780
##  5         ICE STORM   5108617550             0.8715627
##  6       FLASH FLOOD   4309261413             0.8923006
##  7              HAIL   3838952570             0.9107752
##  8           TORNADO   2385937163             0.9222572
##  9    HURRICANE OPAL   2187000000             0.9327820
## 10         HIGH WIND   1918571300             0.9420149

The top 4 event types account for about 85% of the damage, and the top 6 account for about 90%.