Synopsis

This report analyzes the NOAA storm database between the years 1950 to 2011 and answers the following questions:

  1. Across the United States, which type of events are most harmful to population health
  2. Across the United States, which types of events have the greatest economic consequences?

We found that Tornados cause by far the highest impact to human population health in terms of fatalities and injuries followed by Thunderstorm wind and Excessive Heat. We also found that Flood causes the maximum economic damage followed by Hurricane/Typhoon and Tornado.

Data Processing

From the course website, we obtained the NOAA storm database in the form of a comma-separated-value file compressed via the bzip2 algorithm. The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete

Read the raw data

We first read in the csv file from the bzip2 archive. The missing values are blank in the database and there is a header line

        filename <- "repdata-data-StormData.csv.bz2"
        rawdata <- read.csv(filename,
                         header = TRUE,
                         sep = ",",
                         na.strings= "")

This is a big dataset. So we check the struture

        str(rawdata)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29600 levels "5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13512 1872 4597 10591 4371 10093 1972 23872 24417 4597 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 34 levels "  N"," NW","E",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ BGN_LOCATI: Factor w/ 54428 levels "- 1 N Albion",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ END_DATE  : Factor w/ 6662 levels "1/1/1993 0:00:00",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ END_TIME  : Factor w/ 3646 levels " 0900CST"," 200CST",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 23 levels "E","ENE","ESE",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ END_LOCATI: Factor w/ 34505 levels "- .5 NNW","- 11 ESE Jay",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 18 levels "-","?","+","0",..: 16 16 16 16 16 16 16 16 16 16 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 8 levels "?","0","2","B",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ WFO       : Factor w/ 541 levels " CI","$AC","$AG",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ STATEOFFIC: Factor w/ 249 levels "ALABAMA, Central",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ ZONENAMES : Factor w/ 25111 levels "                                                                                                                               "| __truncated__,..: NA NA NA NA NA NA NA NA NA NA ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436780 levels "-2 at Deer Park\n",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

In this report we are only intersted in analysis across the united states and only in population health and economic consequences. Hence all location columns and columns not directly relevent to this can be removed.

First look at the all the columns available

        names(rawdata)
##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"

Now select only the columns that are relevent to our report. These are - EVTYPE: The weather event - FATALITIES: Fatalities due to EVTYPE - INJURIES: Injuries due to EVTYPE - PROPDMG: Property damage value - PROPDMGEXP: The unit of property damage - CROPDMG: Crop damage value - CROPDMGEXP: The unit of Crop damage - REFNUM: Reference number. Retained just in case we need to refer to raw data

        tidydata <- rawdata[ , c("EVTYPE", 
                                 "FATALITIES", 
                                 "INJURIES", 
                                 "PROPDMG", 
                                 "PROPDMGEXP", 
                                 "CROPDMG", 
                                 "CROPDMGEXP", 
                                 "REFNUM")]        

Free up memory by removing rawdata. As a nice bonus the datasize reduced from 500 MB of raw data to 50 MB of tidy data!

        rm(rawdata)

Let us look at the strucure of this dataset

        str(tidydata)
## 'data.frame':    902297 obs. of  8 variables:
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 18 levels "-","?","+","0",..: 16 16 16 16 16 16 16 16 16 16 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 8 levels "?","0","2","B",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

We need to clean up the PROPDMGEXP as it has 18 factors. First take a look at the unique values

        unique(tidydata$PROPDMGEXP)
##  [1] K    M    <NA> B    m    +    0    5    6    ?    4    2    3    h   
## [15] 7    H    -    1    8   
## Levels: - ? + 0 1 2 3 4 5 6 7 8 B h H K m M

This has 18 levels.

For the purpose of this analysis, we have assumed the following values alone are proper data and others are assumed to be 0. Since the question is to find the greatest economic damage, this appears to be a reasonable assumption

-H: Hunderds -K: Thousands -M: Millions -B: Billions -h: This is also assumed to be same as ‘H’ ie hundreds -k: This is also assumed to be same as ‘K’ ie Thousands -m: This is also assumed to be same as ‘M’ ie Millions. -b: This is also assumed to be same as ‘B’ ie Billions.

        exp <- c("H", "h", "K", "k", "M", "m", "B", "b", "0")
        value <- c(100L,        100L, 
                   1000L,       1000L, 
                   1000000L,    1000000L, 
                   1000000000L, 1000000000L,
                   0L)
        df <- data.frame(exp, value)

Treatment of NA and other values not properly coded

While assesing greatest damage, we have to add the property damage value and crop damage value as there could be one without the other. If NA or other values are ignored, then the total will be incorrect. Hence, Also all NAs and improper values are assumed as 0 for the purpose of this analysis. Their effect will anyway be marginal because we have Billions in the data

        tidydata_eco <- tidydata

        tidydata_eco[is.na(tidydata_eco$PROPDMGEXP), ]$PROPDMGEXP <- "0"

        tidydata_eco[!(tidydata_eco$PROPDMGEXP %in% exp), ]$PROPDMGEXP <- "0"

Similarly, We need to clean up the CROPDMGEXP as it has 8 factors. First take a look at the unique values

        unique(tidydata$CROPDMGEXP)
## [1] <NA> M    K    m    B    ?    0    k    2   
## Levels: ? 0 2 B k K m M

We apply the same NA and “exp” treatment as that of PROPDMGEXP

        tidydata_eco[is.na(tidydata_eco$CROPDMGEXP), ]$CROPDMGEXP <- "0"

        tidydata_eco[!(tidydata_eco$CROPDMGEXP %in% exp), ]$CROPDMGEXP <- "0"

Now let us calculate the economic damage. This is essentially the sum of the products of property damage and their exponents

     tidydata_eco$ECODMG <- 
        value[match(tidydata_eco$PROPDMGEXP, df$exp)]*tidydata_eco$PROPDMG  + 
        value[match(tidydata_eco$CROPDMGEXP, df$exp)]*tidydata_eco$CROPDMG

And we also calculate the impact to popluation health. For the purpose of the analysis, we assume that both fatalities and injuries are impacting population health and both have equal weightage

     tidydata_eco$HLTHDMG <- tidydata_eco$FATALITIES + 
                                         tidydata_eco$INJURIES

So far we have cleaned up the columns. Now we can clean up the rows.

Let us first eliminate those rows that have 0 economic damage and 0 health damage

        tidydata_eco <- tidydata_eco[!(tidydata_eco$ECODMG == 0 & 
                                               tidydata_eco$HLTHDMG == 0), ]

Then convert them all to upper case and remove the blank spaces infront

        library(stringr)
        tidydata_eco$EVTYPE <- str_trim(toupper(tidydata_eco$EVTYPE))

Now we have 254331 obs. of 10 variables EVTYPE has 440 values.

Let us further cleanup the EVTYPE by trying to match possible values for the EVTYPE from the valid event types given in section 2.1.1 of code book

## TSTM is assumed to be Thuderstorm
tidydata_eco[grepl("^TSTM", tidydata_eco$EVTYPE, perl = TRUE), ]$EVTYPE <- "THUNDERSTORM WIND"

## Thunderstorms are cleanedup
tidydata_eco[grepl("^THUNDERSTORM", tidydata_eco$EVTYPE, perl = TRUE), ]$EVTYPE <- "THUNDERSTORM WINDS"

## non TSTM winds are classified as HIGH WIND
tidydata_eco[grepl("^NON.TSTM", tidydata_eco$EVTYPE, perl = TRUE), ]$EVTYPE <- "HIGH WIND"


tidydata_eco[grepl("^HIGH WIND", tidydata_eco$EVTYPE, perl = TRUE), ]$EVTYPE <- "HIGH WIND"


## Hail is cleaned up 
tidydata_eco[grepl("^HAIL", tidydata_eco$EVTYPE, perl = TRUE), ]$EVTYPE <- "HAIL"


## Hurricane names are removed
tidydata_eco[grepl("HURRICANE", tidydata_eco$EVTYPE, perl = TRUE), ]$EVTYPE <- "HURRICANE/TYPHOON"

## If it begins with Waterspout, it is given priority.  There are few events
##with both Waterspout and Tornado
tidydata_eco[grepl("^WATERSPOUT", tidydata_eco$EVTYPE, perl = TRUE), ]$EVTYPE <- "WATERSPOUT"

## Event that has Tornado is given priority over others
tidydata_eco[grepl("TORNADO", tidydata_eco$EVTYPE, perl = TRUE), ]$EVTYPE <- "TORNADO"

tidydata_eco[grepl("^FLASH FLOOD", tidydata_eco$EVTYPE, perl = TRUE), ]$EVTYPE <- "FLASH FLOOD"

tidydata_eco[grepl("ICE STORM", tidydata_eco$EVTYPE, perl = TRUE), ]$EVTYPE <- "ICE STORM"

## If there is flash anywhere, it is flash flood
tidydata_eco[grepl("FLASH", tidydata_eco$EVTYPE, perl = TRUE), ]$EVTYPE <- "FLASH FLOOD"

## Now if it begins with flood, make it flood
tidydata_eco[grepl("^FLOOD", tidydata_eco$EVTYPE, perl = TRUE), ]$EVTYPE <- "FLOOD"

## Coastal flood
tidydata_eco[grepl("COAST", tidydata_eco$EVTYPE, perl = TRUE), ]$EVTYPE <- "COASTAL FLOOD"


## IT should not start with extreme cold but has cold anywhere, it is marked as cold
tidydata_eco[grepl("(^(?!EXTREME COLD))(?=.*COLD)", 
                   tidydata_eco$EVTYPE, perl = TRUE), ]$EVTYPE <- "COLD"


tidydata_eco[grepl("FROST|FREEZE", tidydata_eco$EVTYPE, perl = TRUE), ]$EVTYPE <- "FROST/FREEZE"


tidydata_eco[grepl("(^(?!FREEZING FOG))(?=.*FREEZING)", 
             tidydata_eco$EVTYPE, perl = TRUE), ]$EVTYPE <- "FROST/FREEZE"


tidydata_eco[grepl("(^(?!EXCESSIVE HEAT))(^(?!RECORD))(^(?!DROUGHT))(?=.*HEAT)", 
             tidydata_eco$EVTYPE, perl = TRUE), ]$EVTYPE <- "HEAT"
             
tidydata_eco[grepl("(LAKE)(?=.*SNOW)", 
             tidydata_eco$EVTYPE, perl = TRUE), ]$EVTYPE <- "LAKE-EFFECT SNOW"
            
tidydata_eco[grepl("(^(?!LAKE-EFFECT))(?=.*SNOW)", 
             tidydata_eco$EVTYPE, perl = TRUE), ]$EVTYPE <- "HEAVY SNOW"

tidydata_eco[grepl("RAIN", 
             tidydata_eco$EVTYPE, perl = TRUE), ]$EVTYPE <- "HEAVY RAIN"

tidydata_eco[grepl("SURF", 
             tidydata_eco$EVTYPE, perl = TRUE), ]$EVTYPE <- "HEAVY SURF"

tidydata_eco[grepl("(^THU|TUN)", 
             tidydata_eco$EVTYPE, perl = TRUE), ]$EVTYPE <- "THUNDERSTORM WIND"

tidydata_eco[grepl("(^(?!MARINE))(?=.*THUNDER)", 
             tidydata_eco$EVTYPE, perl = TRUE), ]$EVTYPE <- "THUNDERSTORM WIND"

tidydata_eco[grepl("(^(?!THUNDER|MARINE|DUST|EXTREME))(?=.*WIND)", 
             tidydata_eco$EVTYPE, perl = TRUE), ]$EVTYPE <- "HIGH WIND"

tidydata_eco[grepl("LAKE ", 
             tidydata_eco$EVTYPE, perl = TRUE), ]$EVTYPE <- "LAKESHORE FLOOD"

tidydata_eco[grepl("CSTL", 
             tidydata_eco$EVTYPE, perl = TRUE), ]$EVTYPE <- "COASTAL FLOOD"

tidydata_eco[grepl("(^(?!COASTAL|FLASH|LAKE))(?=.*FLOOD)", 
             tidydata_eco$EVTYPE, perl = TRUE), ]$EVTYPE <- "FLOOD"

tidydata_eco[grepl("(^ICE|ICY)", 
             tidydata_eco$EVTYPE, perl = TRUE), ]$EVTYPE <- "ICE STORM"

tidydata_eco[grepl("LIG", 
             tidydata_eco$EVTYPE, perl = TRUE), ]$EVTYPE <- "LIGHTNING"

tidydata_eco[grepl("RECORD", 
             tidydata_eco$EVTYPE, perl = TRUE), ]$EVTYPE <- "EXCESSIVE HEAT"

tidydata_eco[grepl("RIP CURRENT", 
             tidydata_eco$EVTYPE, perl = TRUE), ]$EVTYPE <- "RIP CURRENT"

tidydata_eco[grepl("TROPICAL STORM", 
             tidydata_eco$EVTYPE, perl = TRUE), ]$EVTYPE <- "TROPICAL STORM"

tidydata_eco[grepl("^WINTER WEATHER", 
             tidydata_eco$EVTYPE, perl = TRUE), ]$EVTYPE <- "WINTER WEATHER"

tidydata_eco[grepl("AVALA", 
             tidydata_eco$EVTYPE, perl = TRUE), ]$EVTYPE <- "AVALANCHE"

tidydata_eco[grepl("BLIZZARD", 
             tidydata_eco$EVTYPE, perl = TRUE), ]$EVTYPE <- "BLIZZARD"

tidydata_eco[grepl("DROUGHT", 
             tidydata_eco$EVTYPE, perl = TRUE), ]$EVTYPE <- "DROUGHT"

tidydata_eco[grepl("TORN", 
             tidydata_eco$EVTYPE, perl = TRUE), ]$EVTYPE <- "TORNADO"

The unique events are now 116.

pending_events <- sort(unique(tidydata_eco$EVTYPE))

The master set has 48 + 1 (other). The data is taken from the code pdf section 2.1.1

valid_event <- 
        c("Astronomical Low Tide",
                "Avalanche",
                "Blizzard",
                "Coastal Flood",
                "Cold/Wind Chill",
                "Debris Flow",
                "Dense Fog",
                "Dense Smoke",
                "Drought",
                "Dust Devil",
                "Dust Storm",
                "Excessive Heat",
                "Extreme Cold/Wind Chill",
                "Flash Flood",
                "Flood",
                "Frost/Freeze",
                "Funnel Cloud",
                "Freezing Fog",
                "Hail",
                "Heat",
                "Heavy Rain",
                "Heavy Snow",
                "High Surf",
                "High Wind",
                "Hurricane/Typhoon",
                "Ice Storm",
                "Lake-Effect Snow",
                "Lakeshore Flood",
                "Lightning",
                "Marine Hail",
                "Marine High Wind",
                "Marine Strong Wind",
                "Marine Thunderstorm Wind",
                "Rip Current",
                "Seiche",
                "Sleet",
                "Storm Surge/Tide",
                "Strong Wind",
                "Thunderstorm Wind",
                "Tornado",
                "Tropical Depression",
                "Tropical Storm",
                "Tsunami",
                "Volcanic Ash",
                "Waterspout",
                "Wildfire",
                "Winter Storm",
                "Winter Weather")

        valid_event <- toupper(valid_event)

Now do a matching with the valid events. Where it is not present, fill with “OTHER”. Then create a new column in the dataset and set the values This way we would have both the events and their match available for future reference

        event_match <- valid_event[pmatch(pending_events, valid_event, 
                                        duplicates.ok = TRUE )]

        event_match <- ifelse(is.na(event_match), "OTHER", event_match)

        tidydata_eco$EVTYPE1 <- 
                event_match[match(tidydata_eco$EVTYPE, pending_events)]

Finally we have a clean data set We will take a look at the data for the first 6 rows

        head(tidydata_eco)
##    EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP REFNUM
## 1 TORNADO          0       15    25.0          K       0          0      1
## 2 TORNADO          0        0     2.5          K       0          0      2
## 3 TORNADO          0        2    25.0          K       0          0      3
## 4 TORNADO          0        2     2.5          K       0          0      4
## 5 TORNADO          0        2     2.5          K       0          0      5
## 6 TORNADO          0        6     2.5          K       0          0      6
##   ECODMG HLTHDMG EVTYPE1
## 1  25000      15 TORNADO
## 2   2500       0 TORNADO
## 3  25000       2 TORNADO
## 4   2500       2 TORNADO
## 5   2500       2 TORNADO
## 6   2500       6 TORNADO

We will also look at the last 6 rows as the data collection is much better towards the end

        tail(tidydata_eco)
##              EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG
## 902249 WINTER STORM          0        0     2.0          K       0
## 902250 WINTER STORM          0        0     5.0          K       0
## 902255    HIGH WIND          0        0     0.6          K       0
## 902257    HIGH WIND          0        0     1.0          K       0
## 902259      DROUGHT          0        0     2.0          K       0
## 902260    HIGH WIND          0        0     7.5          K       0
##        CROPDMGEXP REFNUM ECODMG HLTHDMG      EVTYPE1
## 902249          K 902249   2000       0 WINTER STORM
## 902250          K 902250   5000       0 WINTER STORM
## 902255          K 902255    600       0    HIGH WIND
## 902257          K 902257   1000       0    HIGH WIND
## 902259          K 902259   2000       0      DROUGHT
## 902260          K 902260   7500       0    HIGH WIND

Now we calculate the summary of the data for population health and economic datmage

        library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
        event_gp <- group_by(tidydata_eco, EVTYPE1)

        event_smry <- summarize(event_gp, 
                        hlthsum = sum(HLTHDMG), 
                        hlthmn = mean(HLTHDMG),
                        hlthmx = max(HLTHDMG),
                        ecosum = sum(ECODMG),
                        ecomn = mean(ECODMG),
                        ecomx = max(ECODMG))

Results

Impact to population health

Now let us look at the top 10 Atmospheric event that cause maximum impact to population health across the US across all the years

        arrange(event_smry, desc(hlthsum))[1:10, c("EVTYPE1", "hlthsum")]
## Source: local data frame [10 x 2]
## 
##              EVTYPE1 hlthsum
## 1            TORNADO   97022
## 2  THUNDERSTORM WIND   10220
## 3     EXCESSIVE HEAT    8497
## 4              FLOOD    7279
## 5          LIGHTNING    6048
## 6               HEAT    3863
## 7        FLASH FLOOD    2835
## 8              OTHER    2725
## 9          HIGH WIND    2331
## 10         ICE STORM    2262

We will also plot it for easy visualizaiton. Note that log scale is used in y axis for ease of interpretation

        library(ggplot2)

        qplot(EVTYPE1, 
              log(hlthsum), 
              data= arrange(event_smry, desc(hlthsum))[1:10, ],
              xlab = "Top 10 Weather Events",
              ylab = "Log of sum of fatalities and injuries") +
        labs(title = "Figure 1: Consequences to population health") + 
        theme(axis.text.x = element_text(angle = 90, hjust = 1))

By far the highest impact to population health occours due to Tornados followed by Thunderstorm Wind and Excessive heat

Note that the event OTHER (number 8) is high which suggests fine tuning opportunities for this recommendation

Economic Impact

Now let us look at the top 10 Atmospheric event that cause maximum economic damage across the US across all the years

        arrange(event_smry,desc(ecosum))[1:10, c("EVTYPE1", "ecosum")]
## Source: local data frame [10 x 2]
## 
##              EVTYPE1       ecosum
## 1              FLOOD 161000844600
## 2  HURRICANE/TYPHOON  90271472810
## 3            TORNADO  58959393590
## 4   STORM SURGE/TIDE  47965579000
## 5               HAIL  19000564320
## 6        FLASH FLOOD  18440124760
## 7            DROUGHT  15018677780
## 8  THUNDERSTORM WIND  12242125730
## 9          ICE STORM   8981254510
## 10    TROPICAL STORM   8409286550

We will also plot it for easy visualization. (Note the log scale used in y axis)

        qplot(EVTYPE1, 
              log(ecosum), 
              data= arrange(event_smry, desc(ecosum))[1:10, ], 
              xlab = "Top 10 Weather Event",
              ylab = "Log of sum of fatalities and injuries") + 
        labs(title = "Figure 2: Economic Consequences") + 
        theme(axis.text.x = element_text(angle = 90, hjust = 1))

As noted above Flood causes the maximum economic damage, followed by Hurricane/Typhoon and Tornado

Opportunties for further research

In the data, the Event type is obviously not clean. It needs lot of tidying up to match the 48 factors available in the code book. This needs further expert knowledge of the weather data and left for future analysis

However as noted in the top 10 tables and figures, they mostly match the official description in the code book and hence the analysis is assumed to be reasonable.