Introduction

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

Synopsis

This report consists of analysis of the NOAA Storm Data, aiming at answering the questions how the weather events are related to threaten to public health and damage to properties and crops.

Data Processing

The data under study is the NOAA Storm Data, where a detailed documentation of the dataset is given as Storm Data Documentation. We first download the dataset directly from the data website and read the data into storm:

## download, unzip and load data set
url = "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
if(!file.exists("StormData.csv.bz2")) {
      download.file(url, destfile = "StormData.csv.bz2")
}
storm <- read.csv("StormData.csv.bz2", stringsAsFactors = FALSE)
str(storm)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  "" "" "" "" ...
##  $ BGN_LOCATI: chr  "" "" "" "" ...
##  $ END_DATE  : chr  "" "" "" "" ...
##  $ END_TIME  : chr  "" "" "" "" ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  "" "" "" "" ...
##  $ END_LOCATI: chr  "" "" "" "" ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ WFO       : chr  "" "" "" "" ...
##  $ STATEOFFIC: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  "" "" "" "" ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

From the output, we see the dataset consists of 37 features with 902297 observations. Most of the features will not be relevant for our analysis. The main features that we are interested in include:

  1. EVTYPE: Severe weather type
  2. FATALITIES: Number of individuals killed in the event of hazzard
  3. INJURIES: Number of individuals injuried during the event
  4. PROPDMG + PROPDMGEXP: Property damages and the corresponding unit
  5. CROPDMG + CROPDMGEXP: Crop damages and the unit

The raw dataset is extremely untidy, such as existence of missing data, non-unified terminology of weather type. Thus a further data processing is needed. Here, we split our data analysis with respect to public health and economic consequence, by creating two separate clean dataset health and economy.

Public health

First, we subset the INJURIES and FATALITIES data from the storm dataset

library(dplyr)
health <- select(storm, EVTYPE, FATALITIES, INJURIES) %>% 
      mutate_each(funs(toupper), EVTYPE)
str(health)
## 'data.frame':    902297 obs. of  3 variables:
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...

Further, we group the EVTYPE by names and take summation of INJURIES and FATALITIES data within each group. Another column HARMS is added to the dataset, which gives the total number of individuals affected in the event. The dataset is displaced in descending order of HARMS.

## clean `health` grouping elements in `EVTYPE`
healthclean <- aggregate(. ~ EVTYPE, health, FUN = sum) %>% 
      filter(FATALITIES > 0 | INJURIES > 0) %>%
      mutate(HARMS = INJURIES + FATALITIES) %>%
      arrange(desc(HARMS))
healthclean[1:25,]
##                EVTYPE FATALITIES INJURIES HARMS
## 1             TORNADO       5633    91346 96979
## 2      EXCESSIVE HEAT       1903     6525  8428
## 3           TSTM WIND        504     6957  7461
## 4               FLOOD        470     6789  7259
## 5           LIGHTNING        816     5230  6046
## 6                HEAT        937     2100  3037
## 7         FLASH FLOOD        978     1777  2755
## 8           ICE STORM         89     1975  2064
## 9   THUNDERSTORM WIND        133     1488  1621
## 10       WINTER STORM        206     1321  1527
## 11          HIGH WIND        248     1137  1385
## 12               HAIL         15     1361  1376
## 13  HURRICANE/TYPHOON         64     1275  1339
## 14         HEAVY SNOW        127     1021  1148
## 15           WILDFIRE         75      911   986
## 16 THUNDERSTORM WINDS         64      908   972
## 17           BLIZZARD        101      805   906
## 18                FOG         62      734   796
## 19        RIP CURRENT        368      232   600
## 20   WILD/FOREST FIRE         12      545   557
## 21          HEAT WAVE        172      379   551
## 22       RIP CURRENTS        204      297   501
## 23         DUST STORM         22      440   462
## 24     WINTER WEATHER         33      398   431
## 25     TROPICAL STORM         58      340   398

Now, the dataset healthclean contains only 205 observations, dramatically reduced. The first few lines of the dataset indicate that EVTYPE name is still not tidy. For example,

  • THUNDERSTORM WIND and THUNDERSTORM WINDS differ by only an S
  • TSTM WIND and THUNDERSTORM WIND is an example disnormal abbreviation
  • HURRICANE/TYPHOON is simply a concatenate of HURRICANE and TYPHOON, both have similar meaning
  • COLD\WIND CHILL is a combination of two weather events COLD and WIND CHILL

Since the ranked public health data is decaying exponentially fast and we care only the most severe weathers. We can put it by hand the names of the top 20 EVTYPE as

ev_liv <- c("TORNADO"          , "EXCESSIVE HEAT"    , 
            "THUNDERSTORM WIND", "FLOOD"             , 
            "LIGHTNING"        , "HEAT"              , 
            "FLASH FLOOD"      , "ICE STORM"         ,   
            "WINTER STORM"     , "HIGH WIND"         , 
            "HAIL"             , "HURRICANE/TYPHOON" , 
            "HEAVY SNOW"       , "WILDFIRE"          , 
            "BLIZZARD"         , "FOG"               , 
            "RIP CURRENT"      , "DUST STORM"        , 
            "WINTER WEATHER"   , "TROPICAL STORM"    )

ev_liv_rep <- c("TORNADO"       , "EXCESSIVE HEAT"    ,
                "THUNDERSTORM WIND|TSTM WIND"         ,
                "^FLOOD|RIVER FLOOD|/FLOOD"           ,
                "LIGHTNING"                           ,
                "^HEAT"         , "FLASH FLOOD"       , 
                "ICE STORM"     , "WINTER STORM"      ,
                "HIGH WIND"     , "HAIL"              ,
                "HURRICANE/TYPHOON|HURRICANE|TYPHOON" ,
                "HEAVY SNOW"    ,
                "WILDFIRE|WILD/FOREST FIRE"           ,
                "BLIZZARD"      , "FOG"               ,
                "RIP CURRENT"   , "DUST STORM"        ,
                "WINTER WEATHER", "TROPICAL STORM"    )

Once with the standard names defined, one can further clean the EVTYPE data

for (i in seq_along(ev_liv)) {
      ind <- grep(ev_liv_rep[i], healthclean$EVTYPE)
      healthclean$EVTYPE[ind] <- ev_liv[i]
}
healthclean <- aggregate(. ~ EVTYPE, healthclean, FUN = sum) %>%
      arrange(desc(HARMS))
healthclean[1:20,]
##               EVTYPE FATALITIES INJURIES HARMS
## 1            TORNADO       5661    91407 97068
## 2  THUNDERSTORM WIND        728     9493 10221
## 3     EXCESSIVE HEAT       1922     6525  8447
## 4              FLOOD        518     6809  7327
## 5          LIGHTNING        817     5232  6049
## 6               HEAT       1118     2494  3612
## 7        FLASH FLOOD        999     1787  2786
## 8          ICE STORM         89     1990  2079
## 9          HIGH WIND        298     1508  1806
## 10      WINTER STORM        217     1353  1570
## 11          WILDFIRE         87     1456  1543
## 12 HURRICANE/TYPHOON        133     1333  1466
## 13              HAIL         15     1371  1386
## 14        HEAVY SNOW        127     1034  1161
## 15               FOG         81     1077  1158
## 16       RIP CURRENT        577      529  1106
## 17          BLIZZARD        101      805   906
## 18    WINTER WEATHER         61      538   599
## 19        DUST STORM         22      440   462
## 20    TROPICAL STORM         66      383   449

Economic Consequence

Similar to the case with public health data, first, we subset the data relevant to economic consequence.

economy <- select(storm, EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)
str(economy)
## 'data.frame':    902297 obs. of  5 variables:
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...

The features PROPDMGEXP and CROPDMGEXP are composed of character variables, which indicate thousand, million or billion etc (in unit of US dollars).

unique(economy$PROPDMGEXP)
##  [1] "K" "M" ""  "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-"
## [18] "1" "8"
unique(economy$CROPDMGEXP)
## [1] ""  "M" "K" "m" "B" "?" "0" "k" "2"

Here, anomalous entries, e.g. "+", "-", "?", "" is void of real meaning and thus should be omitted. Characters with upper and lower cases, e.g. "H", "h", give the same unit. It is more convinient to deal with only one case, thus all the characters will be transfered into upper case.

economy <-  filter(economy, 
                   !(PROPDMGEXP %in% c("+", "-", "?", "") | CROPDMGEXP %in% c("?", "")) & 
                   (PROPDMG > 0 | CROPDMG > 0)) %>%
      mutate_each(funs(toupper), EVTYPE, PROPDMGEXP, CROPDMGEXP)
unique(economy$PROPDMGEXP)
## [1] "B" "M" "K" "5" "0" "3"
unique(economy$CROPDMGEXP)
## [1] "M" "K" "0" "B"

Now the data in PROPDMGEXP and CROPDMGEXP becomes tidy. The next step is to combine PROPDMG + PROPDMGEXP (CROPDMG + CROPDMGEXP) to give damage in units of billion dollars.

economy$PROPDMGEXP[grep("K", economy$PROPDMGEXP)] <- 3 
economy$CROPDMGEXP[grep("K", economy$CROPDMGEXP)] <- 3 
economy$PROPDMGEXP[grep("M", economy$PROPDMGEXP)] <- 6 
economy$CROPDMGEXP[grep("M", economy$CROPDMGEXP)] <- 6
economy$PROPDMGEXP[grep("B", economy$PROPDMGEXP)] <- 9
economy$CROPDMGEXP[grep("B", economy$CROPDMGEXP)] <- 9

economy <- mutate(economy, 
                  PROPDMG = PROPDMG * 10 ^ (as.numeric(PROPDMGEXP) - 9),
                  CROPDMG = CROPDMG * 10 ^ (as.numeric(CROPDMGEXP) - 9)) %>%
      select(EVTYPE, PROPDMG, CROPDMG)

Given PROPDMG and CROPDMG is already tidy, the next step is to aggregate the dataset according to the weather types in EVTYPE.

economyclean <- aggregate(. ~ EVTYPE, economy, FUN = sum)

The economyclean dataset contains 120 observations. Let us print the top 25 observations when arranging the dataset according to property and crop damage in descending order separately in order to determine names of the EVTYPE

economyclean <- arrange(economyclean, desc(PROPDMG))
economyclean[1:25, c("EVTYPE", "PROPDMG")]
##                        EVTYPE     PROPDMG
## 1                       FLOOD 132.8364891
## 2           HURRICANE/TYPHOON  26.7402950
## 3                     TORNADO  16.1669467
## 4                   HURRICANE   9.7163580
## 5                        HAIL   7.9947887
## 6                 FLASH FLOOD   7.3284961
## 7                 RIVER FLOOD   5.0796350
## 8            STORM SURGE/TIDE   4.6406430
## 9                    WILDFIRE   3.4983655
## 10          THUNDERSTORM WIND   3.3989424
## 11                  HIGH WIND   2.4257423
## 12             HURRICANE OPAL   2.1680000
## 13 TORNADOES, TSTM WIND, HAIL   1.6000000
## 14             TROPICAL STORM   1.0565913
## 15               WINTER STORM   1.0178442
## 16                  ICE STORM   0.9030374
## 17                  TSTM WIND   0.6583057
## 18                  LIGHTNING   0.3152740
## 19                 HEAVY RAIN   0.3107019
## 20         THUNDERSTORM WINDS   0.2801900
## 21          FLASH FLOOD/FLOOD   0.2710000
## 22             HURRICANE ERIN   0.2560000
## 23                    DROUGHT   0.2337210
## 24                 HEAVY SNOW   0.1782920
## 25              COASTAL FLOOD   0.1675806
economyclean <- arrange(economyclean, desc(CROPDMG))
economyclean[1:25, c("EVTYPE", "CROPDMG")]
##                       EVTYPE   CROPDMG
## 1                      FLOOD 5.1709555
## 2                RIVER FLOOD 5.0287340
## 3                  ICE STORM 5.0221135
## 4                  HURRICANE 2.6889100
## 5          HURRICANE/TYPHOON 2.6078728
## 6                       HAIL 2.0538079
## 7                    DROUGHT 1.6526960
## 8                FLASH FLOOD 1.3880291
## 9               FROST/FREEZE 0.9319010
## 10                 HIGH WIND 0.6319243
## 11                 TSTM WIND 0.4972844
## 12            EXCESSIVE HEAT 0.4924000
## 13            TROPICAL STORM 0.4517110
## 14         THUNDERSTORM WIND 0.4147055
## 15                   TORNADO 0.4033796
## 16        THUNDERSTORM WINDS 0.1861133
## 17                  WILDFIRE 0.1861029
## 18                HEAVY SNOW 0.1316431
## 19                  BLIZZARD 0.1120600
## 20          WILD/FOREST FIRE 0.0987072
## 21         FLOOD/FLASH FLOOD 0.0950340
## 22               STRONG WIND 0.0649485
## 23                HEAVY RAIN 0.0645858
## 24               HEAVY RAINS 0.0605000
## 25 SEVERE THUNDERSTORM WINDS 0.0290000

Combine results shown in both prints above, the names chosen to study in EVTYPE is

ev_eco <- c("FLOOD"         , "HURRICANE/TYPHOON" ,
            "TORNADO"       , "HAIL"              ,
            "FLASH FLOOD"   , "STORM SURGE/TIDE"  ,
            "WILDFIRE"      , "THUNDERSTORM WIND" ,
            "HIGH WIND"     , "TROPICAL STORM"    ,
            "WINTER STORM"  , "ICE STORM"         ,
            "LIGHTNING"     , "HEAVY RAIN"        ,
            "DROUGHT"       , "HEAVY SNOW"        ,
            "COASTAL FLOOD" , "FROST/FREEZE"      ,
            "EXCESSIVE HEAT", "STRONG WIND"       )

ev_eco_rep <- c("^FLOOD|RIVER FLOOD|/FLOOD"          ,
                "HURRICANE/TYPHOON|HURRICANE|TYPHOON" ,
                "TORNADO"       , "HAIL"             ,
                "FLASH FLOOD"   , 
                "STORM SURGE/TIDE|STORM TIDE"        ,
                "WILDFIRE|WILD/FOREST FIRE"          ,
                "THUNDERSTORM WIND|TSTM WIND"        ,
                "HIGH WIND"     , "TROPICAL STORM"   ,
                "WINTER STORM"  , "ICE STORM"        ,
                "LIGHTNING"     , "HEAVY RAIN"       ,
                "DROUGHT"       , "HEAVY SNOW"       ,
                "COASTAL FLOOD" , 
                "FROST/FREEZE|FROST|FREEZE"          ,
                "EXCESSIVE HEAT", "STRONG WIND"      )

Given the names of EVTYPE defined for economy dataset, further clean EVTYPE data

for (i in seq_along(ev_eco)) {
      ind <- grep(ev_eco_rep[i], economyclean$EVTYPE)
      economyclean$EVTYPE[ind] <- ev_eco[i]
}
economyclean <- aggregate(. ~ EVTYPE, economyclean, FUN = sum) %>%
      arrange(desc(PROPDMG))
economyclean[1:20,]
##               EVTYPE     PROPDMG     CROPDMG
## 1              FLOOD 138.4239991 10.33268695
## 2  HURRICANE/TYPHOON  38.9968830  5.33311780
## 3            TORNADO  17.7669570  0.40588687
## 4               HAIL   7.9997102  2.07883995
## 5        FLASH FLOOD   7.3895281  1.39813510
## 6   STORM SURGE/TIDE   4.6406430  0.00085000
## 7  THUNDERSTORM WIND   4.3397600  1.12724733
## 8           WILDFIRE   3.5477275  0.28532210
## 9          HIGH WIND   2.6982874  0.66415485
## 10    TROPICAL STORM   1.0620913  0.46826100
## 11      WINTER STORM   1.0183442  0.02422400
## 12         ICE STORM   0.9030374  5.02211350
## 13        HEAVY RAIN   0.3352019  0.12658580
## 14         LIGHTNING   0.3152740  0.00551215
## 15           DROUGHT   0.2339210  1.65274600
## 16     COASTAL FLOOD   0.1928806  0.00005600
## 17        HEAVY SNOW   0.1782920  0.13164310
## 18         LANDSLIDE   0.1503035  0.02001700
## 19           TSUNAMI   0.1440620  0.00002000
## 20       STRONG WIND   0.1196554  0.06494850

Results

Public health

In the following, The results of ranking for both injuries and fatalities will be shown in separate tables

arrange(healthclean, desc(INJURIES))[1:10, c("EVTYPE", "INJURIES")]
##               EVTYPE INJURIES
## 1            TORNADO    91407
## 2  THUNDERSTORM WIND     9493
## 3              FLOOD     6809
## 4     EXCESSIVE HEAT     6525
## 5          LIGHTNING     5232
## 6               HEAT     2494
## 7          ICE STORM     1990
## 8        FLASH FLOOD     1787
## 9          HIGH WIND     1508
## 10          WILDFIRE     1456
arrange(healthclean, desc(FATALITIES))[1:10, c("EVTYPE", "FATALITIES")]
##               EVTYPE FATALITIES
## 1            TORNADO       5661
## 2     EXCESSIVE HEAT       1922
## 3               HEAT       1118
## 4        FLASH FLOOD        999
## 5          LIGHTNING        817
## 6  THUNDERSTORM WIND        728
## 7        RIP CURRENT        577
## 8              FLOOD        518
## 9          HIGH WIND        298
## 10         AVALANCHE        224

The following shows the harm to public health from the severe weather events, combining the injury and fatality data (top 10).

library(ggplot2)
library(reshape)
library(scales)
# reorder `healthclean` leading to ordered dataset after melting
healthclean <- transform(healthclean, EVTYPE = reorder(EVTYPE, desc(HARMS)))

# subset and melt to generate suitable data for barplot
counts <- healthclean[1:10, c("EVTYPE", "FATALITIES", "INJURIES")]
counts <- melt(counts, id.vars = "EVTYPE")

# ggplot `identity` required, x ticks set to vertical
g <- ggplot(counts, aes(EVTYPE, value, fill = variable)) 
g + geom_bar(stat = "identity") +
      theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
      labs(x = "", y = "Individuals Affected",
           title = "Weather Events Affecting Public Health")

There are several interesting conclusions obtained from the analysis:

  1. Undoubtedly, the most severe weather event to human health is the TORNADO, which leads to almost one million injuries and six thousand death over the years 1950 - 2011.
  2. The weather leads to second highest death rate is HEAT or EXCESSIVE HEAT, instead of other events, such as FLOOD or THUNDERSTORM.

Economic consequence

The following two graphs indicate property and crop damages caused by severe weather events respectively.

par(mfrow = c(1, 2), mar = c(12, 4, 3, 2), mgp = c(3, 1, 0), cex = 0.8)
with(arrange(economyclean, desc(PROPDMG))[1:10,],
     barplot(PROPDMG, las = 3, names.arg = EVTYPE, 
             main = "Weather Events: Property Damages", 
             ylab = "Cost of damages (billions usd)", col = "blue"))
with(arrange(economyclean, desc(CROPDMG))[1:10,],
     barplot(CROPDMG, las = 3, names.arg = EVTYPE, 
             main = "Weather Events: Crop Damages", 
             ylab = "Cost of damages (billions usd)", col = "red"))

As a conclusion to this part of the analysis:

  1. different to the case to public health, the hazzardous event with respect to economic properties is FLOOD, followed by HURRICANE/TYPHOON.
  2. TORNADO, which leads to significant damage to properties, does not seem to affect crop that much.
  3. weather events like ICE STORM or DROUGHT, though leads to crop damage, does not damage the properties.
  4. overall, the property damages caused by severe weathers are more significant, compared with the damage to crops.