Introduction

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern. This report involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

Synopsis

The objective of this report is to collect the data, preprocess (clean up) it and explore the Storm events in US over the period of 1950 to 2011, and thereby to identify the most severe events that cause the most damage to humans and property. We retrieve the data for this task from the National Weather Service Storm Data Documentation. We then look for inconsistent and erraneous data, typical of historically maintained data sources, and transform the dataset into a tidy one. From this and with the associated documentation on the data fields, we plot some exploratory plots/charts to visualy grasp the impact of these storm events on humans and property. Every step is documented along the way in a reproducible manner.

Data Retrieval and Processing

Historical storm data has been maintained by U.S. National Oceanic and Atmospheric Administration NOAA There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined DFN

The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. we download the raw bzip file first, record the data downloaded, uncompress it to get the CSV file inside. We then read the CSV file into a data.frame of R language and display the number of rows and columns.

library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(R.utils)
## Loading required package: R.oo
## Loading required package: R.methodsS3
## R.methodsS3 v1.6.1 (2014-01-04) successfully loaded. See ?R.methodsS3 for help.
## R.oo v1.18.0 (2014-02-22) successfully loaded. See ?R.oo for help.
## 
## Attaching package: 'R.oo'
## 
## The following objects are masked from 'package:methods':
## 
##     getClasses, getMethods
## 
## The following objects are masked from 'package:base':
## 
##     attach, detach, gc, load, save
## 
## R.utils v1.34.0 (2014-10-07) successfully loaded. See ?R.utils for help.
## 
## Attaching package: 'R.utils'
## 
## The following object is masked from 'package:utils':
## 
##     timestamp
## 
## The following objects are masked from 'package:base':
## 
##     cat, commandArgs, getOption, inherits, isOpen, parse, warnings
library(scales)
library(stringr)
library(reshape2)
library(data.table)
## 
## Attaching package: 'data.table'
## 
## The following objects are masked from 'package:dplyr':
## 
##     between, last
fetch_datafiles_if_not_already <- function() {
    require(R.utils)

    # Lets download the zipFile and extract our inputFile off of it
    inputFile <- "repdata-data-StormData.csv"
    if (!file.exists(inputFile)) {
        bzipFile <- "repdata-data-StormData.csv.bz2"
        if (!file.exists(bzipFile)) {
            bzipUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
            tryCatch({
                message("Downloading a huge bzip file. Please wait....")
                download.file(bzipUrl, destfile=bzipFile,
                              method="curl", quiet=TRUE)

                # Display the downloaded data
                download_date <- Sys.Date()
                download_date
            }, error = function(cond) {
                stop(paste("Error dowloading bzipFile: ", bzipFile,
                           ". Reason: ", cond))
            })
        }

        tryCatch({
            bunzip2(bzipFile)  # Unzip to current working dir
        }, error = function(cond) {
            stop(paste("Error extracting bzipFile: ", bzipFile,
                       ". Reason: ", cond))
        })
    }
}


read_storm_data_efficient <- function() {
    require(data.table)

    fetch_datafiles_if_not_already()

    # dfSD must be defined in the caller's scope for efficiency
    if (is.null(SD)) {
        inputFile <- "repdata-data-StormData.csv"
        message("Loading huge data file 'StormData'. Please be patient!")
        SD <<- read.csv("repdata-data-StormData.csv", header=TRUE,
                       nrow=1240000, sep=",", comment.char="",
                       stringsAsFactors=FALSE)

        # Show the number of rows and columns of our data
        # just after datafile is read
        dim(SD)

        preprocess_data()
    }
}

Data Pre-Processing / Cleanup

The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete. We see that the EVTYPE column helps us differentiate various events but this is a free form text field probably human input that has various sorts of data integrity issues ranging from spelling mistakes to empty values to nonsensical values such as LIGHTNING vs LIGHNTING to ‘APACHE TERRITORY’ to " " (empty strings). This results in too many classifications (~840) whereas the associated documentation suggestes only around ~50 classifications. We perform a sequence of the cleaning operations to consolidate(reduce) this variable from ~840 values to ~45 values.

# Crop damage and Property Damage columns have another column associated
# them that give the exponent (10 to the power) for the value
# We need to combine those pairs of columsn
exponent_code_to_value <- function(e) {
    ifelse(e=='H', 100,
               ifelse(e=='T' | e == 'K', 1000, 
                      ifelse(e=='M', 10^6, 
                             ifelse(e=='B', 10^9, 1))))
}


preprocess_data <- function() {
    # Take only these columns and throw away the rest
    # BGN_DATE, STATE, EVTYPE, FATALITIES, INJURIES,
    # PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP
    requiredCols <- c(2, 7, 8, 23:28)
    SD <<- SD[,requiredCols]

    # Consistent lower-case variable(column) names
    names(SD) <<- tolower(names(SD))

    # The EVTYPE variable has many human error characteristics
    # typical of free form text input. Lets apply a series of
    # transformations to that column
    SD$evtype.orig <<- SD$evtype

    SD$evtype <<- sanitize_event_type(SD$evtype)

    # Crop damage and Property Damage columns have another column associated
    # them that give the exponent (10 to the power) for the value
    # We need to combine those pairs of columsn
    ok <- complete.cases(SD$cropdmg, SD$cropdmgexp)
    SD$cropdmg[ok] <<- SD$cropdmg[ok] * exponent_code_to_value(toupper(SD$cropdmgexp[ok]))
    ok <- complete.cases(SD$propdmg, SD$propdmgexp)
    SD$propdmg[ok] <<- SD$propdmg[ok] * exponent_code_to_value(toupper(SD$propdmgexp[ok]))
    
    #Just after preprocessing, display the number of rows & cols, col-names
    dim(SD)
    names(SD)
}

The pre-processing step involves reducing the multiple storm type string values into few manageable and consistent values.

  • First we convert the evtype values to a consistent uppercase and trim the leading and trailing spaced. So a value " Heavy Wind / High Surf " becomes “HEAVY WIND / HIGH SURF”

  • Then we replace each non-word character (that is not an alphabet or digit or underscore) with a single space and a sequence of spaces with a single space as well.. “HEAVY WIND / HIGH SURF” becomes “HEAVY WIND HIGH SURF”

  • Remove observations with human data entry errors. We can’t classify them anyway.

  • Then apply a series of regular expression match and replacements to reduce the ~840 possibilities into ~40 possibilities.

sanitize_event_type <- function(evtype) {
    # First, upper-case, trim leading and training whitespace
    evtype <- str_trim(toupper(evtype))

    # Replace non-word chars with a single space. Multiple spaces into one space
    evtype <- gsub("\\W+", " ", evtype)
    evtype <- gsub("\\s+", " ", evtype)

    # Human errors; can't classify them
    evtype <- gsub("^(SUMMARY|MONTHLY|NO |NONE|SEI|APACHE|SOUTH|\\s+).*",
                   NA, evtype) 

    # Apply these pattern replacements to reduce the ~900 unique possibilities
    # into a manageable ~40 possible classifications.
    pat <- c(
        #----------------------------
        #"pattern","replacement"
        #----------------------------
        "^RECORD\\s*HIGH", "HIGH",
        "^RECORD\\s*LOW", "LOW",
        "^(RECORD|EXCESSIVE)", "HIGH",
        "^AVALAN.*", "Avalanche",
        "^BEACH ERO.*", "Beach Erosion",
        ".*BLIZZAR.*", "Blizzard",
        "^BLOWING SNOW.*", "Blowing Snow",
        "^COASTAL.*FLO.*", "Coastal Flood",
        "^COASTAL.*STO.*", "Coastal Storm",
        ".*(COLD|ICE|ICY|WET|WINT|FREEZ|CHILL|COOL).*", "Cold",
        ".*(THUNDER.*ST|TSTM).*", "Thunderstorm",
        "^DUST.*(DEV|STO).*", "Dust Storm",
        "^EXTRE.*COLD.*", "Extream Cold",
        ".*(FLASH.*FLOO|URB.*STR).*", "Flash Flood",
        ".*(FLOOD|SMALL\\s*STR).*", "Flood",
        "^URBAN.*FLOOD.*", "Flood",
        "^FREEZ.*RAIN.*", "Freezing Rain",
        "^HAIL.*", "Hail Storm",
        "^HEAVY.*(RAIN|SHOWER).*", "Heavy Rain",
        "^HEAVY.*SNOW.*", "Heavy Snow",
        "^HIGH.*WIND.*", "High Wind",
        "^HIGH.*SURF.*", "High Surf",
        "^HURRICANE.*", "Hurrycane",
        "^HYP(O|ER)TH.*", "Hypothermia",
        "^(ICE|ICY).*STOR.*", "Icy Storm",
        "^LIGHTNING.*", "Lightning",
        "^LIGHT.*SNOW.*", "Light Snow",
        ".*(LAND|MUD).*(SLIDE|SLUMP).*", "Land Slide",
        "^RAIN.*", "Rain",
        "^SNOW.*", "Snow",
        "^TIDAL.*FLOO.*", "Tidal Flood",
        ".*(FLOOD|FLD).*", "Flood",
        "^TORN(ADO|DAO).*", "Tornado",
        "^TROPICAL.*STORM.*", "Tropical Storm",
        "^VOLCAN.*ERUP.*", "Volcanic Eruption",
        "^VOLCAN.*ASH.*", "Volcanic Ash",
        "^WATER.*SPOU.*", "Water Spout",
        ".*WIND CHILL.*", "Wind Chill",
        "^WIND.*", "Wind",
        "^WINTER.*STOR.*", "Winter Storm",
        ".*UNSEASON.*", "Unseasonable Weather",
        ".*(SWELL|WAVE|WATER|SURF|COASTAL|TIDE|SEA|MARINE).*", "Heavy Tide",
        ".*(WIND|WND|GUST|DUST).*", "Wind",
        ".*SNOW.*", "Heavy Snow",
        ".*(FIRE|SMOKE).*", "Fire",
        ".*(DRY|WARM|HEAT|HOT|DRIEST|DROUGHT).*", "Heat",
        ".*FOG.*", "Fog",
        ".*(FUNNEL|CLOUD).*", "Cloud",
        ".*(RAIN|PRECIP).*", "Heavy Rain",
        ".*HAIL.*", "Hail",
        ".*LIG.*ING.*", "Lightning",
        ".*HIGH.*TEMP.*", "Heat",
        ".*(FROST|LOW.*TEMP).*", "Cold",
        ".*SLIDE.*", "Landslide",
        ".*TSUNA.*", "Tsunami",
        ".*TYPHOO.*", "Typhoon",
        ".*SPOUT.*", "Water Spout",
        ".*DROWN.*", "Drowning",
        ".*URBAN.*SMALL.*", "Flash Flood",
        ".*DAM\\s+.*", "Dam Failure",
        "^[A-Z][A-Z]+.*", NA  # Ignore evertying else (noise)
    )
    for (r in seq(1, length(pat), 2)) {
        pattern <- pat[r]; replacement <- pat[r+1];
        evtype <- gsub(pattern, replacement, x=evtype)
    }
    evtype
}

Now, we read the data and make the first level of counts per each event type total fatalities, total injuries, total property damages and total crop damages.

SD <- NULL
read_storm_data_efficient()
## Loading huge data file 'StormData'. Please be patient!
##  [1] "bgn_date"    "state"       "evtype"      "fatalities"  "injuries"   
##  [6] "propdmg"     "propdmgexp"  "cropdmg"     "cropdmgexp"  "evtype.orig"
# Get totals of fatality, injury, cropdmg, propdmg, per each unique event
df <- SD %>%
        group_by(evtype) %>%
        summarise(tot_fatal = sum(fatalities), 
                  tot_injur = sum(injuries),
                  tot_propdmg = sum(propdmg),
                  tot_cropdmg = sum(cropdmg))
df <- df %>% filter(complete.cases(df))  # Remove invalid data entries

Question #1: Which events are the most harmful to Public Health/Life ?

    dfh <- df %>%
            filter(tot_fatal > 0 | tot_injur > 0) %>%
            arrange(tot_fatal)
    y_scale <- seq_along(1:NROW(dfh))
    dfh <- dfh %>% mutate(evtype_f = factor(y_scale, levels=y_scale, labels=evtype))

    dfh # This is the final contents of our Tidy data, just before plotting
## Source: local data frame [36 x 6]
## 
##                  evtype tot_fatal tot_injur  tot_propdmg tot_cropdmg
## 1                 Cloud         0         3       194600           0
## 2                  Hail         0        10        70000    20793000
## 3               Typhoon         0         5    600230000      825000
## 4              Drowning         1         0            0           0
## 5            Light Snow         1         2      2598000           0
## 6          Blowing Snow         2        14        15000           0
## 7         Coastal Storm         4         2        50000           0
## 8                  Rain         5         2      5350050      750000
## 9         Coastal Flood         6         7    427566060       56000
## 10          Water Spout         6        72     60737200           0
## 11                 Snow         8       102     18347550       10000
## 12          Hypothermia         9         0            0           0
## 13           Hail Storm        15      1361  15974470043  3026094623
## 14           Dust Storm        24       483      6318130     3600000
## 15              Tsunami        33       129    144062000       20000
## 16 Unseasonable Weather        40        17            0    10010000
## 17           Land Slide        44        55    327246100    20017000
## 18       Tropical Storm        66       383   7714390550   694896000
## 19                  Fog        80      1076     22829500           0
## 20                 Fire        90      1608   8496728500   403281630
## 21             Blizzard       101       806    664913950   112060000
## 22           Heavy Rain       102       308   3226834140   795402800
## 23            High Surf       104       156     90155000           0
## 24           Heavy Snow       130      1024    983879152   134653100
## 25            Hurrycane       135      1328  84756180010  5515292800
## 26                 Wind       143       408    191630290    71030050
## 27            Avalanche       225       170      3721800           0
## 28            High Wind       293      1467   5887806043   679301900
## 29           Heavy Tide       299       534   4676669040     6450000
## 30                Flood       483      6795 150189663635 10842825950
## 31         Thunderstorm       754      9544  12576128656  1274208988
## 32            Lightning       817      5231    933732447    12092090
## 33                 Cold       874      4531  11036484311  8652384850
## 34          Flash Flood      1064      1879  16964543937  1540685250
## 35                 Heat      2960      8864   1062504300 14871450280
## 36              Tornado      5633     91364  56941932479   414961470
## Variables not shown: evtype_f (fctr)
    # Melt a 3 column dataset into a 2 column dataset for easy 
    # two series-of-points plotting in the same plot
    df2 <- melt(dfh[,c("evtype_f", "tot_fatal", "tot_injur")], id="evtype_f")
    df2$variable <- gsub("tot_fatal", "Fatalities", df2$variable)
    df2$variable <- gsub("tot_injur", "Injuries", df2$variable)

    ggplot(df2, aes(y=evtype_f, x=log(value), colour=variable, size=value)) + 
      geom_point(shape=20) + 
      scale_size_continuous(range = c(4,15)) +
      ylab("Event Type") +
      xlab("Health Impact (Fatalities/Injuries) in log scale") +
      ggtitle("Storm Events impact on US Population Health for 1950-2011")

Result #1

The plot shows that over the course of years 1950 to 2011, Tornados, Thunderstorms, Heat, Hail Storms appear to be causing most Human fatalities/injuries for the years 1950-2011“. Tornado is the most damaging event with ~5600 deaths and ~91400 injuries.

  # Top most 5 severe events 
  head(arrange(dfh, -tot_fatal)[,c(1,2,3)], 5)
## Source: local data frame [5 x 3]
## 
##        evtype tot_fatal tot_injur
## 1     Tornado      5633     91364
## 2        Heat      2960      8864
## 3 Flash Flood      1064      1879
## 4        Cold       874      4531
## 5   Lightning       817      5231

Question #2: Which events cause the most damage to US Properties/Agriculture ?

    dfp <- df %>%
            filter(complete.cases(evtype, tot_propdmg, tot_cropdmg) & 
                       tot_propdmg > 0 | tot_cropdmg > 0) %>%
            arrange(tot_propdmg)
    y_scale <- seq_along(1:NROW(dfp))
    dfp <- dfp %>% mutate(evtype_f = factor(y_scale, levels=y_scale, labels=evtype))

    dfp # This is the final contents of our Tidy data, just before plotting
## Source: local data frame [38 x 6]
## 
##                  evtype tot_fatal tot_injur  tot_propdmg tot_cropdmg
## 1  Unseasonable Weather        40        17            0    10010000
## 2          Blowing Snow         2        14        15000           0
## 3         Coastal Storm         4         2        50000           0
## 4                  Hail         0        10        70000    20793000
## 5         Beach Erosion         0         0       100000           0
## 6             Landslide         0         0       150000           0
## 7                 Cloud         0         3       194600           0
## 8          Volcanic Ash         0         0       500000           0
## 9           Dam Failure         0         0      1002000           0
## 10           Light Snow         1         2      2598000           0
## 11            Avalanche       225       170      3721800           0
## 12                 Rain         5         2      5350050      750000
## 13           Dust Storm        24       483      6318130     3600000
## 14                 Snow         8       102     18347550       10000
## 15                  Fog        80      1076     22829500           0
## 16          Water Spout         6        72     60737200           0
## 17            High Surf       104       156     90155000           0
## 18              Tsunami        33       129    144062000       20000
## 19                 Wind       143       408    191630290    71030050
## 20           Land Slide        44        55    327246100    20017000
## 21        Coastal Flood         6         7    427566060       56000
## 22              Typhoon         0         5    600230000      825000
## 23             Blizzard       101       806    664913950   112060000
## 24            Lightning       817      5231    933732447    12092090
## 25           Heavy Snow       130      1024    983879152   134653100
## 26                 Heat      2960      8864   1062504300 14871450280
## 27           Heavy Rain       102       308   3226834140   795402800
## 28           Heavy Tide       299       534   4676669040     6450000
## 29            High Wind       293      1467   5887806043   679301900
## 30       Tropical Storm        66       383   7714390550   694896000
## 31                 Fire        90      1608   8496728500   403281630
## 32                 Cold       874      4531  11036484311  8652384850
## 33         Thunderstorm       754      9544  12576128656  1274208988
## 34           Hail Storm        15      1361  15974470043  3026094623
## 35          Flash Flood      1064      1879  16964543937  1540685250
## 36              Tornado      5633     91364  56941932479   414961470
## 37            Hurrycane       135      1328  84756180010  5515292800
## 38                Flood       483      6795 150189663635 10842825950
## Variables not shown: evtype_f (fctr)
    # Melt a 3 column dataset into a 2 column dataset for easy 
    # two series-of-points plotting in the same plot
    df2 <- melt(dfp[,c("evtype_f", "tot_cropdmg", "tot_propdmg")], id="evtype_f")
    df2$variable <- gsub("tot_cropdmg", "Crops", df2$variable)
    df2$variable <- gsub("tot_propdmg", "Properties", df2$variable)

    ggplot(df2, aes(y=evtype_f, x=log(value), colour=variable, size=value)) + 
      geom_point(shape=20) + 
      scale_size_continuous(range = c(4,15), labels=comma) +
      ylab("Event Type") +
      xlab("Financial Impact (Crops/Property damages) in log scale") +
      ggtitle("Storm events impact on US Properties/Crops for 1950-2011")

Results #2

The plot shows that over the course of years 1950 to 2011, Flood, Hurricane, Tornados and Floods are severe events that have caused the most damage to the Crops and Properties in United States. Flood is the most damaging event with 150 Billion USD in Property damage and 11 Billion USD in Crop damage.

  # Top most 5 severe events 
  head(arrange(dfp, -tot_propdmg)[,c(1,4,5)], 5)
## Source: local data frame [5 x 3]
## 
##        evtype  tot_propdmg tot_cropdmg
## 1       Flood 150189663635 10842825950
## 2   Hurrycane  84756180010  5515292800
## 3     Tornado  56941932479   414961470
## 4 Flash Flood  16964543937  1540685250
## 5  Hail Storm  15974470043  3026094623