Synopsis

In this report we study the severe weather events across the United States between 1996 and 2011, using the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. More specifically, we want to find out which types of events are the most harmful with respect to population health, and which have the greatest economic consequences. The data between the years 1996 and 2011 are used, excluding the earlier years data, because recent years data are more complete. Our analysis found that the most harmful event types are Tornado, Excessive Heat, and Flood. On the other hand, event types having the greatest economic consequences are Flood, Hurricane, and Storm Surge/Tide.

Data Processing

Reading Data

First we manually download the data from the link provided in the course website, save it in the working directory and unzip it. Then we read the raw csv.bz2 file where the missing data are indicated by empty strings.

df <- read.csv("repdata-data-StormData.csv.bz2", na.strings = "")

Using Data Only From Recent Years

First we plot an exploratory histogram about the years of the events, from 1950 to 2011.

library(lubridate)
df_year <- year(mdy_hms(df$BGN_DATE))
hist(df_year, xlab="Year", main="Histogram of Event Years")

We see that in the earlier years there are much fewer events recorded, which is probably due to a lack of good records. On the other hand, recent years should be considered more complete. Therefore for a fair comparison among the different types of events, and also for speeding up our data cleaning process, we make a decision of using only the recent years data. The cutoff year we choose is 1995.

library(dplyr)
df <- df %>% filter(year(mdy_hms(BGN_DATE)) > 1995)

Notice that since we are only interested in finding out which types of events are the most harmful and have the greatest economic consequences, discarding the fewer, earlier years data would have little effect on our conclusion.

Filtering

Once again that we are only interested in finding out which types of events are the most harmful and have the greatest economic consequences, hence we only keep data with fatalities/injuries, or with property/crop damages of at least one million dollars. Since there are data with property/crop damages of billions of dollars, damages of less than one million dollars (less than 0.1 percent of the highest damage) are not significant to our anaylsis.

## Creating a new function to convert the exponential symbols to numbers.
exp_convert <- function(x) {
    if (x[1] == "B") {
        1000000000
    } else if (x[1] == "M") {
        1000000
    } else {
        0
    }
}

## Setting NA to zero
df[is.na(df$PROPDMGEXP),]$PROPDMGEXP <- 0
df[is.na(df$CROPDMGEXP),]$CROPDMGEXP <- 0

## Computing the actual damage by multiplying the DMG columns 
## with the corresponding converted exponential numbers
df <- df %>% group_by(PROPDMGEXP) %>% 
    mutate(PROPDMGEXPCON = exp_convert(PROPDMGEXP)) %>%
    mutate(PROPDMGMULTIPLIED = PROPDMG * PROPDMGEXPCON)
df <- df %>% group_by(CROPDMGEXP) %>% 
    mutate(CROPDMGEXPCON = exp_convert(CROPDMGEXP)) %>%
    mutate(CROPDMGMULTIPLIED = CROPDMG * CROPDMGEXPCON)

## Filtering
df <- df %>% filter(FATALITIES + INJURIES > 0 | 
                        PROPDMGMULTIPLIED >= 1000000 |
                        CROPDMGMULTIPLIED >= 1000000)

Data Cleaning

The next step of our data processing is cleaning the \(EVTYPE\) variable (which indicates the type of events). Our goal in this section is to match the data in \(ENTYPE\) with the event names listed in Table 1 of Section 2.1.1 of the National Weather Service Storm Data Documentation.

The following is a list of raw data entries in \(EVTYPE\):

sort(unique(df$EVTYPE))
##   [1]  TSTM WIND                AGRICULTURAL FREEZE      
##   [3] ASTRONOMICAL HIGH TIDE    AVALANCHE                
##   [5] BLACK ICE                 BLIZZARD                 
##   [7] blowing snow              BRUSH FIRE               
##   [9] COASTAL  FLOODING/EROSION Coastal Flood            
##  [11] COASTAL FLOOD             Coastal Flooding         
##  [13] COASTAL FLOODING          COASTAL FLOODING/EROSION 
##  [15] Coastal Storm             COASTAL STORM            
##  [17] COASTALSTORM              Cold                     
##  [19] COLD                      COLD AND SNOW            
##  [21] Cold Temperature          COLD WEATHER             
##  [23] COLD/WIND CHILL           DAM BREAK                
##  [25] Damaging Freeze           DAMAGING FREEZE          
##  [27] DENSE FOG                 DROUGHT                  
##  [29] DROWNING                  DRY MICROBURST           
##  [31] Dust Devil                DUST DEVIL               
##  [33] DUST STORM                Early Frost              
##  [35] Erosion/Cstl Flood        EXCESSIVE HEAT           
##  [37] EXCESSIVE SNOW            Extended Cold            
##  [39] Extreme Cold              EXTREME COLD             
##  [41] EXTREME COLD/WIND CHILL   EXTREME WINDCHILL        
##  [43] FALLING SNOW/ICE          FLASH FLOOD              
##  [45] FLOOD                     FOG                      
##  [47] Freeze                    FREEZE                   
##  [49] FREEZING DRIZZLE          FREEZING FOG             
##  [51] FREEZING RAIN             Freezing Spray           
##  [53] FROST                     Frost/Freeze             
##  [55] FROST/FREEZE              FUNNEL CLOUD             
##  [57] GLAZE                     GUSTY WIND               
##  [59] Gusty winds               Gusty Winds              
##  [61] GUSTY WINDS               HAIL                     
##  [63] HARD FREEZE               HAZARDOUS SURF           
##  [65] HEAT                      Heat Wave                
##  [67] HEAVY RAIN                Heavy Rain/High Surf     
##  [69] HEAVY SEAS                HEAVY SNOW               
##  [71] Heavy snow shower         Heavy Surf               
##  [73] HEAVY SURF                Heavy surf and wind      
##  [75] HEAVY SURF/HIGH SURF      HIGH SEAS                
##  [77] High Surf                 HIGH SURF                
##  [79] HIGH SWELLS               HIGH WATER               
##  [81] HIGH WIND                 HURRICANE                
##  [83] Hurricane Edouard         HURRICANE/TYPHOON        
##  [85] HYPERTHERMIA/EXPOSURE     Hypothermia/Exposure     
##  [87] HYPOTHERMIA/EXPOSURE      ICE ON ROAD              
##  [89] ICE ROADS                 ICE STORM                
##  [91] ICY ROADS                 LAKE-EFFECT SNOW         
##  [93] LAKESHORE FLOOD           LANDSLIDE                
##  [95] LANDSLIDES                LIGHT SNOW               
##  [97] LIGHTNING                 Marine Accident          
##  [99] MARINE HIGH WIND          MARINE STRONG WIND       
## [101] MARINE THUNDERSTORM WIND  MARINE TSTM WIND         
## [103] MIXED PRECIP              Mudslide                 
## [105] MUDSLIDE                  Mudslides                
## [107] NON TSTM WIND             NON-SEVERE WIND DAMAGE   
## [109] OTHER                     RAIN/SNOW                
## [111] RECORD HEAT               RIP CURRENT              
## [113] RIP CURRENTS              RIVER FLOOD              
## [115] River Flooding            RIVER FLOODING           
## [117] ROGUE WAVE                ROUGH SEAS               
## [119] ROUGH SURF                SMALL HAIL               
## [121] Snow                      SNOW                     
## [123] SNOW AND ICE              SNOW SQUALL              
## [125] Snow Squalls              STORM SURGE              
## [127] STORM SURGE/TIDE          STRONG WIND              
## [129] Strong Winds              STRONG WINDS             
## [131] THUNDERSTORM              THUNDERSTORM WIND        
## [133] THUNDERSTORM WIND (G40)   TIDAL FLOODING           
## [135] TORNADO                   Torrential Rainfall      
## [137] TROPICAL DEPRESSION       TROPICAL STORM           
## [139] TSTM WIND                 TSTM WIND (G35)          
## [141] TSTM WIND (G40)           TSTM WIND (G45)          
## [143] TSTM WIND/HAIL            TSUNAMI                  
## [145] TYPHOON                   Unseasonable Cold        
## [147] UNSEASONABLY COLD         UNSEASONABLY WARM        
## [149] UNSEASONAL RAIN           URBAN/SML STREAM FLD     
## [151] WARM WEATHER              WATERSPOUT               
## [153] Whirlwind                 WILD/FOREST FIRE         
## [155] WILDFIRE                  WIND                     
## [157] WIND AND WAVE             WINDS                    
## [159] WINTER STORM              WINTER WEATHER           
## [161] WINTER WEATHER MIX        WINTER WEATHER/MIX       
## [163] WINTRY MIX               
## 985 Levels:    HIGH SURF ADVISORY  COASTAL FLOOD ... WND

Filtering

The data corresponding to “Other” and “Record Heat” are not needed, so we take them out.

df <- df %>% filter(!grepl("other|record", EVTYPE, ignore.case=TRUE))

Next we will try to go along the list and relabel the entries.

1. Translating into Upper Case

The first relabelling action we do is to translate all the characters into upper case.

df <- df %>% mutate(EVTYPE = toupper(EVTYPE))

3. Thunderstorm Wind and Marine Thunderstorm Wind

We recongnize that “TSTM” is an abbreviation of Thunderstorm, and we use regular expression to group all the thunderstorm data together (but not Marine Thunderstorm Wind). Same process is done for Marine Thunderstorm Wind.

df$EVTYPE[grepl("^ *TSTM|^THUNDERSTORM", df$EVTYPE)] <- "THUNDERSTORM WIND"
df$EVTYPE[grepl("MARINE TSTM WIND", df$EVTYPE)] <- "MARINE THUNDERSTORM WIND"

4. Storm Surge/Tide

We realize that high tide and storm surge falls under the category of Storm Surge/Tide.

df$EVTYPE[grepl("HIGH TIDE|STORM SURGE", df$EVTYPE)] <- "STORM SURGE/TIDE"

5. Ice

Black Ice and other ice related entries (but not ice storem) fall under the category of Winter Weather.

df$EVTYPE[grepl(".+ICE|IC.*ROAD", df$EVTYPE)] <- "WINTER WEATHER"

6. Snow and Rain

Here we deal with the data containing the word “snow”. Rain/Snow is relabelled as Sleet, Light Snow goes under Winter Weather, Cold and Snow under Cold/Wind Chill, and finally the rest under Heavy Snow except Lake-Effect Snow.

df$EVTYPE[grepl("RAIN/SNOW", df$EVTYPE)] <- "SLEET"
df$EVTYPE[grepl("LIGHT SNOW", df$EVTYPE)] <- "WINTER WEATHER"
df$EVTYPE[grepl("COLD AND SNOW", df$EVTYPE)] <- "COLD/WIND CHILL"
df$EVTYPE[grepl("SNOW", df$EVTYPE) &
              !grepl("LAKE-EFFECT SNOW", df$EVTYPE)] <- "HEAVY SNOW"

After relabelling Rain/Snow, we can put the rest of data containing “rain” under the event type Heavy Rain.

df$EVTYPE[grepl("RAIN", df$EVTYPE)] <- "HEAVY RAIN"

7. Coastal Flood

We recongnize that “CSTL” is an abbreviation of Coastal, and we also put Tital Flooding under Coastal Flood.

df$EVTYPE[grepl("COASTAL +FLOOD|CSTL|TIDAL", df$EVTYPE)] <- "COASTAL FLOOD"

8. Coastal Storm

We put the coastal storm data under Tropical Storm.

df$EVTYPE[grepl("COASTAL ?STORM", df$EVTYPE)] <- "TROPICAL STORM"

9. Cold/Wind Chill and Extreme Cold/Wind Chill

All data containing the word “extreme” are put under the category of Extreme Cold/Wind Chill, while the rest of data containing “cold” are put under Cold/Wind Chill.

df$EVTYPE[grepl("EXTREME", df$EVTYPE)] <- "EXTREME COLD/WIND CHILL"
df$EVTYPE[grepl("COLD", df$EVTYPE) &
              !grepl("EXTREME COLD", df$EVTYPE)] <- "COLD/WIND CHILL"

10. Flash Flood

Dam Break is relabelled as Flash Flood.

df$EVTYPE[grepl("DAM BREAK", df$EVTYPE)] <- "FLASH FLOOD"

11. Fog

Fog is relabelled as Dense Fog.

df$EVTYPE[grepl("^FOG", df$EVTYPE)] <- "DENSE FOG"

12. Frost/Freeze

All data containing “freezing” are put under Frost/Freeze, except Freezing Fog. Glaze is put under Frost/Freeze as well.

df$EVTYPE[grepl("FREEZING|GLAZE", df$EVTYPE) &
              !grepl("FREEZING FOG", df$EVTYPE)] <- "FROST/FREEZE"

13. Strong Wind

Gusty wind(s), Wind(s), Non TSTM Wind, Non-Severe Wind Damage, and Strong Winds are relabelled as Strong Wind.

df$EVTYPE[grepl("^WIND.?$|GUSTY|NON.*WIND|^STRONG WIND", df$EVTYPE)] <- 
    "STRONG WIND"

14. High Surf

Next we put all data containing “surf”, “swells”, “wave”, together with those ending with “water” and “seas” under High Surf.

df$EVTYPE[grepl("SURF|SWELLS|WAVE|SEAS$|WATER$", df$EVTYPE)] <- "HIGH SURF"

15. Heat

We relabel Heat Wave as Heat, and we also put data containing “warm” under Heat.

df$EVTYPE[grepl("HEAT WAVE|WARM", df$EVTYPE)] <- "HEAT"

16. Hyperthermia and Hypothermia

They are put under Heat and Cold/Wind Chill respectively.

df$EVTYPE[grepl("HYPERTHERMIA", df$EVTYPE)] <- "HEAT"
df$EVTYPE[grepl("HYPOTHERMIA", df$EVTYPE)] <- "COLD/WIND CHILL"

17. Debris Flow

Landslide(s) and Mudslide(s) are relabelled as Debris Flow.

df$EVTYPE[grepl("LANDSLIDE|MUDSLIDE", df$EVTYPE)] <- "DEBRIS FLOW"

18. Sleet

We recognize that “Mixed Precip” stands for Mixed Precipitation, hence it should be put under Sleet.

df$EVTYPE[grepl("MIXED PRECIP", df$EVTYPE)] <- "SLEET"

19. Flood

River Flood(ing) belongs to the category of Flood, as well as Urban/Small Stream Flooding which is abbreviated as “Urban/Sml Stream Fld”.

df$EVTYPE[grepl("RIVER FLOOD|FLD", df$EVTYPE)] <- "FLOOD"

20. Hail

Small Hail is now put under the event type Hail.

df$EVTYPE[grepl("SMALL HAIL", df$EVTYPE)] <- "HAIL"

21. Tornado

Whirlwind is relabelled as Tornado.

df$EVTYPE[grepl("WHIRLWIND", df$EVTYPE)] <- "TORNADO"

21. Filtering

Finally we filter out the data that do not fit into any category listed in National Weather Service Storm Data Documentation. Those are Drowning, Dry MicroBurst, and Marine Accident

df <- df %>% filter(!grepl("DROWNING|MICROBURST|MARINE ACCIDENT", EVTYPE))

Results

1. Most Harmful Event Types with respect to Population Health

In order to find out which types of events are the most harmful with respect to population health, we add up the fatalities and injuries number for each event type. Then we add up the 2 numbers to form a column named Total. The events that have the highest total number are shown below.

df1 <- df %>% group_by(EVTYPE) %>% 
    summarize(FATALITIES = sum(FATALITIES), INJURIES = sum(INJURIES), 
              TOTAL = sum(FATALITIES) + sum(INJURIES)) %>%
    arrange(desc(TOTAL))

head(df1, n=10)
## Source: local data frame [10 x 4]
## 
##               EVTYPE FATALITIES INJURIES TOTAL
## 1            TORNADO       1512    20667 22179
## 2     EXCESSIVE HEAT       1797     6391  8188
## 3              FLOOD        444     6838  7282
## 4  THUNDERSTORM WIND        379     5129  5508
## 5          LIGHTNING        651     4141  4792
## 6        FLASH FLOOD        887     1674  2561
## 7           WILDFIRE         87     1458  1545
## 8       WINTER STORM        191     1292  1483
## 9               HEAT        238     1241  1479
## 10         HURRICANE        125     1328  1453

We see that Tornado is the most harmful event by a large margin. It is followed by Excessive Heat, Flood, Thunderstorm Wind and Lightning. After that the number drops significantly. We remark that even though the difference in the injuries numbers between Tornado and Excessive Heat is huge, Excessive Heat has the highest number of fatalities so we can consider it to be more lethal than Tornado. We conlude this subsection by plotting the above table in a bar chart.

library(reshape2)
df1_long <- melt(df1[1:10,1:3], id.vars="EVTYPE")

library(ggplot2)
ggplot(df1_long, aes(x = EVTYPE, y = value, fill = variable)) + 
    geom_bar(stat="identity") +
    theme(axis.text.x=element_text(angle=90)) +
    labs(title="Most Harmful Event Types") +
    labs(x="Event Type", y="TOTAL")

2. Event Types Having the Greatest Economic Consequences

Similar to the previous part, we add up the property damage and the crop damage for each event type, and then we sum the 2 numbers to form a column named Total Damage. The events with the highest total number are shown below, where the numbers are in billion dollars.

df2 <- df %>% group_by(EVTYPE) %>% 
    summarize(PROP_DAMAGE = sum(PROPDMGMULTIPLIED)/10^9, 
              CROP_DAMAGE = sum(CROPDMGMULTIPLIED)/10^9, 
              TOTAL_DAMAGE= sum(PROPDMGMULTIPLIED)/10^9 + 
                  sum(CROPDMGMULTIPLIED)/10^9) %>%
    arrange(desc(TOTAL_DAMAGE))

head(df2, n=10)
## Source: local data frame [10 x 4]
## 
##               EVTYPE PROP_DAMAGE CROP_DAMAGE TOTAL_DAMAGE
## 1              FLOOD   143.27821     4.86032    148.13853
## 2          HURRICANE    81.70724     5.34348     87.05072
## 3   STORM SURGE/TIDE    47.81798     0.00000     47.81798
## 4            TORNADO    23.43980     0.19349     23.63329
## 5               HAIL    14.02981     1.99875     16.02856
## 6        FLASH FLOOD    13.98040     1.17501     15.15541
## 7            DROUGHT     1.04305    13.34662     14.38967
## 8     TROPICAL STORM     7.59392     0.67312      8.26704
## 9           WILDFIRE     7.64324     0.39410      8.03734
## 10 THUNDERSTORM WIND     5.71324     0.83765      6.55089

The event type that causes the greatest economic damage is Flood, followed by Hurricane and Storm Surge/Tide. Notice that the differences among the total damages of these 3 event types are large. We also remark that the event type with the highest crop damage is Drought, which is also the only one in the above table with its crop damage higher than its property damage. Finally we conclude this subsection with a bar chart of the above result.

df2_long <- melt(df2[1:10,1:3], id.vars="EVTYPE")

library(ggplot2)
ggplot(df2_long, aes(x = EVTYPE, y = value, fill = variable)) + 
    geom_bar(stat="identity") +
    theme(axis.text.x=element_text(angle=90)) +
    labs(title="Event Types with Greatest Economic Consequences") +
    labs(x="Event Type", y="TOTAL DAMAGE (billion dollars)")