Synopsis

This document contains an analysis of data from U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database and aims to study the impact of major storms on population’s heath and economy.
This is the second assignment of Reproducible Research Course by Prof. R.Peng on Coursera.
We perform an analysis of the consequences of severe weather events on population’s health end economy.
We analyze the impact of severe weather events considering both the total damage and the average magnitude of events of the same type.
We find different clusters of events distinguishing rare vs non rare events and weak vs powerful events.

Data Processing

Data for the analysis are provided in csv format by course website at the following link.
Data are collected from 1950 to November 2011.

Before starting the analysis we ensure to have the global options required by the assignment.
The following packages are required:
- knitr
- dplyr
- ggplot2
- bitops
- RCurl
- R.utils
- lubridate
- xtable

  library ( bitops )
  library ( RCurl )
## 
## Attaching package: 'RCurl'
## 
## The following object is masked from 'package:R.utils':
## 
##     reset
## 
## The following object is masked from 'package:R.oo':
## 
##     clone
  library ( knitr )
  library ( dplyr )
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
  library ( ggplot2 )
  library ( R.utils )
  library ( lubridate )
  library ( xtable )
  opts_chunk$set(echo = TRUE , fig.height = 4)
  
# set locale to get names in English
  lct <- Sys.getlocale("LC_TIME")
  Sys.setlocale("LC_TIME", "C")

Data are contained in the file repdata-data-StormData.csv.bz2 in the working directory.
In order to read the data we must unzip the file and then read the resulting csv file.
Given the size of the file this operation can take a few minutes.

    bunzip2("repdata-data-StormData.csv.bz2", destname=gsub("[.]bz2$", "", "repdata-data-StormData.csv.bz2"), overwrite=TRUE, remove=FALSE)
        
    storm <- read.csv ( "repdata-data-StormData.csv" , header = TRUE )

After loading the data we can examine the content of the dataset.

    dim ( storm )
## [1] 902297     37
    str ( storm )
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","  N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436774 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...
    head ( storm )
##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL
## 6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL
##    EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO         0                                               0
## 2 TORNADO         0                                               0
## 3 TORNADO         0                                               0
## 4 TORNADO         0                                               0
## 5 TORNADO         0                                               0
## 6 TORNADO         0                                               0
##   COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1         NA         0                      14.0   100 3   0          0
## 2         NA         0                       2.0   150 2   0          0
## 3         NA         0                       0.1   123 2   0          0
## 4         NA         0                       0.0   100 2   0          0
## 5         NA         0                       0.0   150 2   0          0
## 6         NA         0                       1.5   177 2   0          0
##   INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1       15    25.0          K       0                                    
## 2        0     2.5          K       0                                    
## 3        2    25.0          K       0                                    
## 4        2     2.5          K       0                                    
## 5        2     2.5          K       0                                    
## 6        6     2.5          K       0                                    
##   LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1     3040      8812       3051       8806              1
## 2     3042      8755          0          0              2
## 3     3340      8742          0          0              3
## 4     3458      8626          0          0              4
## 5     3412      8642          0          0              5
## 6     3450      8748          0          0              6

NAs are not a problem because relevant columns are complete.

    summary ( storm )
##     STATE__                  BGN_DATE             BGN_TIME     
##  Min.   : 1.0   5/25/2011 0:00:00:  1202   12:00:00 AM: 10163  
##  1st Qu.:19.0   4/27/2011 0:00:00:  1193   06:00:00 PM:  7350  
##  Median :30.0   6/9/2011 0:00:00 :  1030   04:00:00 PM:  7261  
##  Mean   :31.2   5/30/2004 0:00:00:  1016   05:00:00 PM:  6891  
##  3rd Qu.:45.0   4/4/2011 0:00:00 :  1009   12:00:00 PM:  6703  
##  Max.   :95.0   4/2/2006 0:00:00 :   981   03:00:00 PM:  6700  
##                 (Other)          :895866   (Other)    :857229  
##    TIME_ZONE          COUNTY         COUNTYNAME         STATE       
##  CST    :547493   Min.   :  0   JEFFERSON :  7840   TX     : 83728  
##  EST    :245558   1st Qu.: 31   WASHINGTON:  7603   KS     : 53440  
##  MST    : 68390   Median : 75   JACKSON   :  6660   OK     : 46802  
##  PST    : 28302   Mean   :101   FRANKLIN  :  6256   MO     : 35648  
##  AST    :  6360   3rd Qu.:131   LINCOLN   :  5937   IA     : 31069  
##  HST    :  2563   Max.   :873   MADISON   :  5632   NE     : 30271  
##  (Other):  3631                 (Other)   :862369   (Other):621339  
##                EVTYPE         BGN_RANGE       BGN_AZI      
##  HAIL             :288661   Min.   :   0          :547332  
##  TSTM WIND        :219940   1st Qu.:   0   N      : 86752  
##  THUNDERSTORM WIND: 82563   Median :   0   W      : 38446  
##  TORNADO          : 60652   Mean   :   1   S      : 37558  
##  FLASH FLOOD      : 54277   3rd Qu.:   1   E      : 33178  
##  FLOOD            : 25326   Max.   :3749   NW     : 24041  
##  (Other)          :170878                  (Other):134990  
##          BGN_LOCATI                  END_DATE             END_TIME     
##               :287743                    :243411              :238978  
##  COUNTYWIDE   : 19680   4/27/2011 0:00:00:  1214   06:00:00 PM:  9802  
##  Countywide   :   993   5/25/2011 0:00:00:  1196   05:00:00 PM:  8314  
##  SPRINGFIELD  :   843   6/9/2011 0:00:00 :  1021   04:00:00 PM:  8104  
##  SOUTH PORTION:   810   4/4/2011 0:00:00 :  1007   12:00:00 PM:  7483  
##  NORTH PORTION:   784   5/30/2004 0:00:00:   998   11:59:00 PM:  7184  
##  (Other)      :591444   (Other)          :653450   (Other)    :622432  
##    COUNTY_END COUNTYENDN       END_RANGE      END_AZI      
##  Min.   :0    Mode:logical   Min.   :  0          :724837  
##  1st Qu.:0    NA's:902297    1st Qu.:  0   N      : 28082  
##  Median :0                   Median :  0   S      : 22510  
##  Mean   :0                   Mean   :  1   W      : 20119  
##  3rd Qu.:0                   3rd Qu.:  0   E      : 20047  
##  Max.   :0                   Max.   :925   NE     : 14606  
##                                            (Other): 72096  
##            END_LOCATI         LENGTH           WIDTH            F         
##                 :499225   Min.   :   0.0   Min.   :   0   Min.   :0       
##  COUNTYWIDE     : 19731   1st Qu.:   0.0   1st Qu.:   0   1st Qu.:0       
##  SOUTH PORTION  :   833   Median :   0.0   Median :   0   Median :1       
##  NORTH PORTION  :   780   Mean   :   0.2   Mean   :   8   Mean   :1       
##  CENTRAL PORTION:   617   3rd Qu.:   0.0   3rd Qu.:   0   3rd Qu.:1       
##  SPRINGFIELD    :   575   Max.   :2315.0   Max.   :4400   Max.   :5       
##  (Other)        :380536                                   NA's   :843563  
##       MAG          FATALITIES     INJURIES         PROPDMG    
##  Min.   :    0   Min.   :  0   Min.   :   0.0   Min.   :   0  
##  1st Qu.:    0   1st Qu.:  0   1st Qu.:   0.0   1st Qu.:   0  
##  Median :   50   Median :  0   Median :   0.0   Median :   0  
##  Mean   :   47   Mean   :  0   Mean   :   0.2   Mean   :  12  
##  3rd Qu.:   75   3rd Qu.:  0   3rd Qu.:   0.0   3rd Qu.:   0  
##  Max.   :22000   Max.   :583   Max.   :1700.0   Max.   :5000  
##                                                               
##    PROPDMGEXP        CROPDMG        CROPDMGEXP          WFO        
##         :465934   Min.   :  0.0          :618413          :142069  
##  K      :424665   1st Qu.:  0.0   K      :281832   OUN    : 17393  
##  M      : 11330   Median :  0.0   M      :  1994   JAN    : 13889  
##  0      :   216   Mean   :  1.5   k      :    21   LWX    : 13174  
##  B      :    40   3rd Qu.:  0.0   0      :    19   PHI    : 12551  
##  5      :    28   Max.   :990.0   B      :     9   TSA    : 12483  
##  (Other):    84                   (Other):     9   (Other):690738  
##                                STATEOFFIC    
##                                     :248769  
##  TEXAS, North                       : 12193  
##  ARKANSAS, Central and North Central: 11738  
##  IOWA, Central                      : 11345  
##  KANSAS, Southwest                  : 11212  
##  GEORGIA, North and Central         : 11120  
##  (Other)                            :595920  
##                                                                                                                                                                                                     ZONENAMES     
##                                                                                                                                                                                                          :594029  
##                                                                                                                                                                                                          :205988  
##  GREATER RENO / CARSON CITY / M - GREATER RENO / CARSON CITY / M                                                                                                                                         :   639  
##  GREATER LAKE TAHOE AREA - GREATER LAKE TAHOE AREA                                                                                                                                                       :   592  
##  JEFFERSON - JEFFERSON                                                                                                                                                                                   :   303  
##  MADISON - MADISON                                                                                                                                                                                       :   302  
##  (Other)                                                                                                                                                                                                 :100444  
##     LATITUDE      LONGITUDE        LATITUDE_E     LONGITUDE_    
##  Min.   :   0   Min.   :-14451   Min.   :   0   Min.   :-14455  
##  1st Qu.:2802   1st Qu.:  7247   1st Qu.:   0   1st Qu.:     0  
##  Median :3540   Median :  8707   Median :   0   Median :     0  
##  Mean   :2875   Mean   :  6940   Mean   :1452   Mean   :  3509  
##  3rd Qu.:4019   3rd Qu.:  9605   3rd Qu.:3549   3rd Qu.:  8735  
##  Max.   :9706   Max.   : 17124   Max.   :9706   Max.   :106220  
##  NA's   :47                      NA's   :40                     
##                                            REMARKS           REFNUM      
##                                                :287433   Min.   :     1  
##                                                : 24013   1st Qu.:225575  
##  Trees down.\n                                 :  1110   Median :451149  
##  Several trees were blown down.\n              :   569   Mean   :451149  
##  Trees were downed.\n                          :   446   3rd Qu.:676723  
##  Large trees and power lines were blown down.\n:   432   Max.   :902297  
##  (Other)                                       :588294

We format the dates so later we can have access to the dates functions.

    begin_date <- as.POSIXlt ( storm$BGN_DATE , format="%m/%d/%Y" )
    end_date   <- as.POSIXlt ( storm$END_DATE , format="%m/%d/%Y" ) 

# create year and decade variable
    begin_year <- year ( begin_date  )
    decade <- ((begin_year %% 100) %/% 10) * 10
    decade[decade < 50] <- decade[decade < 50] + 100  

# add new variables 
    storm <- cbind ( storm , begin_date , end_date , begin_year , decade )

# keep workspace clean
    rm ( begin_date , end_date , begin_year , decade )

Having observed an unequal distribution of events in the first decades we decide to focus on observations starting from 1994.

    storm <-storm[ storm$begin_year>=1994 ,]

The columns reporting damage amount (PROPDMG, PROPDMGEXP, CROPDMG ,CROPDMGEXP) should be read together. First coumns contains the damage amount rounded to 3 significant digits. The EXP column contain the metric. The codebook reports the valid values for EXP are H K M B. Since there are only few valid values we remove them. We also create a binary variable indicating whether there have been an economic damage.

    # new property damage exp column
    newPropExp <- rep ( 0 , length= nrow(storm))
    newPropExp[storm$PROPDMGEXP %in% c("H","h")] <- 1e2
    newPropExp[storm$PROPDMGEXP %in% c("K","k")] <- 1e3
    newPropExp[storm$PROPDMGEXP %in% c("M","m")] <- 1e6
    newPropExp[storm$PROPDMGEXP %in% c("B","b")] <- 1e9

    # new crop damage exp column
    newCropExp <-  rep ( 0 , length= nrow(storm))
    newCropExp[storm$CROPDMGEXP %in% c("H","h")] <- 1e2
    newCropExp[storm$CROPDMGEXP %in% c("K","k")] <- 1e3
    newCropExp[storm$CROPDMGEXP %in% c("M","m")] <- 1e6
    newCropExp[storm$CROPDMGEXP %in% c("B","b")] <- 1e9

# flag for economic damage presence
    prop_damage <- storm$PROPDMG * newPropExp   
    crop_damage <- storm$CROPDMG * newCropExp
    economic_damage <- rep ( 0 , length= nrow(storm)) 
    economic_damage [ newPropExp>0 | newCropExp >0 ] <- 1
    storm <- cbind ( storm  , prop_damage , crop_damage , economic_damage )

# keep workspace clean
    rm ( newPropExp , newCropExp , prop_damage , crop_damage , economic_damage )

The columns reporting population damage are Fatalities and Injuries.
We create a binary variable indicating whether there have been health damages.

# flag for population damage presence
    health_damage <- rep ( 0 , length= nrow(storm))  
    health_damage[ storm$FATALITIES>0 | storm$INJURIES >0 ] <- 1

    storm <- cbind ( storm  , health_damage )
    rm ( health_damage )

Now we proceed to format the event variable.
There are over 985 different entries for the event type.
There are a lot of mispelled labels. We process them in order to have a clean list of events.

  storm$event <- factor( tolower( storm$EVTYPE )) # uniformly lower all the letters
  storm <- storm[ -grep( "^summary(.*| *.*)" , x= storm$event) , ] # remove summary events
  storm$event <- factor( storm$event ) # remove unused levels



# format levels
  storm$event <- gsub( pattern= "tstm(.*| *.*)" , replacement= "thunderstorm", x= storm$event , ignore.case= TRUE) 
  storm$event <- gsub( pattern= "thunderstorm(.*| *.*)" , replacement= "thunderstorm", x= storm$event , ignore.case= TRUE) 
  storm$event <- gsub( pattern= "thu(.*) wind(.*)" , replacement= "thunderstorm", x= storm$event , ignore.case= TRUE) 
  storm$event <- gsub( pattern= "(.*)storm wind(.*)" , replacement= "thunderstorm", x= storm$event , ignore.case= TRUE) 
  storm$event <- gsub( pattern= "tornado(.*| *.*)" , replacement= "tornado", x= storm$event , ignore.case= TRUE) 
  storm$event <- gsub( pattern= "tropical(.*| *.*)" , replacement= "tropical storm", x= storm$event , ignore.case= TRUE) 
  storm$event <- gsub( pattern= "aval(.*)" , replacement= "avalanche", x= storm$event , ignore.case= TRUE)
  storm$event <- gsub( pattern= "blizzard(.*| *.*)" , replacement= "blizzard", x= storm$event , ignore.case= TRUE)
  storm$event <- gsub( pattern= "(.*| *.*)blizzard(.*| *.*)" , replacement= "blizzard", x= storm$event , ignore.case= TRUE) 

  storm$event <- gsub( pattern= "winter storm(.*)" , replacement= "winter storm", x= storm$event , ignore.case= TRUE) 
  storm$event <- gsub( pattern= "winter we(.*)" , replacement= "winter weather", x= storm$event , ignore.case= TRUE) 
  storm$event <- gsub( pattern= "wild(.*| .*)" , replacement= "wildfire", x= storm$event , ignore.case= TRUE)
  storm$event <- gsub( pattern= "(.*| .*)waterspout(.*| .*)" , replacement= "waterspout", x= storm$event , ignore.case= TRUE)
  storm$event <- gsub( pattern= "lig(.*)ing(.*| *.*)" , replacement= "lightning", x= storm$event , ignore.case= TRUE) 
  storm$event <- gsub( pattern= "lake(.*| *.*)effect(.*| *.*)" , replacement= "lake-effect snow", x= storm$event , ignore.case= TRUE) 
  storm$event <- gsub( pattern= "hurricane(.*| .*)" , replacement= "hurricane", x= storm$event , ignore.case= TRUE)
  storm$event <- gsub( pattern= "high wind(.*| .*)" , replacement= "high wind", x= storm$event , ignore.case= TRUE)
  storm$event <- gsub( pattern= "heavy snow(.*| .*)" , replacement= "heavy snow", x= storm$event , ignore.case= TRUE)
  storm$event <- gsub( pattern= "heavy rain(.*| .*)" , replacement= "heavy rain", x= storm$event , ignore.case= TRUE)
  storm$event <- gsub( pattern= "cold(.*| .*)" , replacement= "cold/wind chill", x= storm$event , ignore.case= TRUE)
  storm$event <- gsub( pattern= "drought(.*| .*)" , replacement= "drought", x= storm$event , ignore.case= TRUE)
  storm$event <- gsub( pattern= "coast(.*| .*)flood(.*)" , replacement= "coastal flood", x= storm$event , ignore.case= TRUE)
  storm$event <- gsub( pattern= "(.*| .*)surf(.*| .*)" , replacement= "high surf", x= storm$event , ignore.case= TRUE)
  storm$event <- gsub( pattern= "hail(.*| .*)" , replacement= "hail", x= storm$event , ignore.case= TRUE)
  storm$event <- gsub( pattern= "heat(.*| .*)" , replacement= "heat", x= storm$event , ignore.case= TRUE)
  storm$event <- gsub( pattern= "dust devil(.*| .*)" , replacement= "dust devil", x= storm$event , ignore.case= TRUE)
  storm$event <- gsub( pattern= "dust storm(.*| .*)" , replacement= "dust storm", x= storm$event , ignore.case= TRUE)
  storm$event <- gsub( pattern= "volcanic(.*| .*)" , replacement= "volcanic ash", x= storm$event , ignore.case= TRUE)
  storm$event <- gsub( pattern= "urban(.*| .*)flood(.*| .*)" , replacement= "flood", x= storm$event , ignore.case= TRUE)
  storm$event <- gsub( pattern= "river(.*| .*)flood(.*| .*)" , replacement= "flood", x= storm$event , ignore.case= TRUE)
  storm$event <- gsub( pattern= "urban(.*| .*)small(.*| .*)" , replacement= "flood", x= storm$event , ignore.case= TRUE)
  storm$event <- gsub( pattern= "rip(.*| .*)curr(.*| .*)" , replacement= "rip current", x= storm$event , ignore.case= TRUE)
  storm$event <- gsub( pattern= "ext(.*| .*)cold(.*| .*)" , replacement= "extreme cold/wind chill", x= storm$event , ignore.case= TRUE)
  storm$event <- gsub( pattern= "ext(.*| .*)wind(.*| .*)" , replacement= "extreme cold/wind chill", x= storm$event , ignore.case= TRUE)
  storm$event <- gsub( pattern= "storm(.*| .*)surge(.*| .*)" , replacement= "storm surge/tide", x= storm$event , ignore.case= TRUE)
  storm$event <- gsub( pattern= "storm(.*| .*)tide(.*| .*)" , replacement= "storm surge/tide", x= storm$event , ignore.case= TRUE)
  storm$event <- gsub( pattern= "(.*| .*)high(.*| .*)tide(.*| .*)" , replacement= "storm surge/tide", x= storm$event , ignore.case= TRUE)
  storm$event <- gsub( pattern= "flash(.*| .*)flood(.*| .*)" , replacement= "flash flood", x= storm$event , ignore.case= TRUE)
  storm$event <- gsub( pattern= "flood(.*| .*)flash(.*| .*)" , replacement= "flash flood", x= storm$event , ignore.case= TRUE)

  storm$event <- gsub( pattern= "frost( *)" , replacement= "frost/freeze", x= storm$event , ignore.case= TRUE)
  storm$event <- gsub( pattern= "freeze( *)" , replacement= "frost/freeze", x= storm$event , ignore.case= TRUE)
  storm$event <- gsub( pattern= "frost(.*| .*)" , replacement= "frost/freeze", x= storm$event , ignore.case= TRUE)
  storm$event <- gsub( pattern= "(.*| .*)frost(.*| .*)" , replacement= "frost/freeze", x= storm$event , ignore.case= TRUE)

  storm$event <- gsub( pattern= "lake(.*| .*)flood(.*| .*)" , replacement= "lakeshore flood", x= storm$event , ignore.case= TRUE)
  storm$event <- gsub( pattern= "(.*| .*)ice(.*| .*)storm(.*| .*)" , replacement= "ice storm", x= storm$event , ignore.case= TRUE)
  storm$event <- gsub( pattern= "(.*| .*)storm(.*| .*)ice(.*| .*)" , replacement= "ice storm", x= storm$event , ignore.case= TRUE)
  storm$event <- gsub( pattern= "mud(.*| .*)slide(.*| .*)" , replacement= "mud slide", x= storm$event , ignore.case= TRUE)
  storm$event <- gsub( pattern= "(.*| .*)sleet(.*| .*)" , replacement= "sleet", x= storm$event , ignore.case= TRUE)
  storm$event <- gsub( pattern= "hypothermia(.*| .*)" , replacement= "cold/wind chill", x= storm$event , ignore.case= TRUE)
  storm$event <- gsub( pattern= "(.*| .*)surge(.*| .*)" , replacement= "storm surge/tide", x= storm$event , ignore.case= TRUE)
  storm$event <- gsub( pattern= "strong wind(.*| .*)" , replacement= "strong wind", x= storm$event , ignore.case= TRUE)
  storm$event <- gsub( pattern= "flood(.*| .*)" , replacement= "flood", x= storm$event , ignore.case= TRUE)
  storm$event <- gsub( pattern= "falling snow(.*| .*)" , replacement= "avalanche", x= storm$event , ignore.case= TRUE)
  storm$event <- gsub( pattern= "low temp(.*| .*)" , replacement= "cold/wind chill", x= storm$event , ignore.case= TRUE)
  storm$event <- gsub( pattern= "coastal(.*| .*)" , replacement= "coastal flood", x= storm$event , ignore.case= TRUE)
  storm$event <- gsub( pattern= "unseason(.*| .*)cold(.*| .*)" , replacement= "cold/wind chill", x= storm$event , ignore.case= TRUE)
  storm$event <- gsub( pattern= "unseason(.*| .*)warm(.*| .*)" , replacement= "heat", x= storm$event , ignore.case= TRUE)
  storm$event <- gsub( pattern= "record(.*| .*)cold(.*| .*)" , replacement= "extreme cold/wind chill", x= storm$event , ignore.case= TRUE)
  storm$event <- gsub( pattern= "record(.*| .*)heat(.*| .*)" , replacement= "excessive heat", x= storm$event , ignore.case= TRUE)
    storm$event <- gsub( pattern= "thundersnow" , replacement= "ice storm", x= storm$event , ignore.case= TRUE)
    storm$event <- gsub( pattern= "small(.*| .*)flood(.*| .*)" , replacement= "flood", x= storm$event , ignore.case= TRUE)
    storm$event <- gsub( pattern= "hyperthermia(.*| .*)" , replacement= "heat", x= storm$event , ignore.case= TRUE)
    storm$event <- gsub( pattern= "warm(.*| .*)" , replacement= "heat", x= storm$event , ignore.case= TRUE)
    storm$event <- gsub( pattern= "small hail(.*| .*)" , replacement= "hail", x= storm$event , ignore.case= TRUE)
  storm$event <- gsub( pattern= "major(.*| .*)flood(.*| .*)" , replacement= "flood", x= storm$event , ignore.case= TRUE)
  storm$event <- gsub( pattern= "minor(.*| .*)flood(.*| .*)" , replacement= "flood", x= storm$event , ignore.case= TRUE)
  storm$event <- gsub( pattern= "severe(.*| .*)thunder(.*| .*)" , replacement= "thunderstorm", x= storm$event , ignore.case= TRUE)
  storm$event <- gsub( pattern= "cstl flood(.*| .*)" , replacement= "coastal flood", x= storm$event , ignore.case= TRUE)
  storm$event <- gsub( pattern= "rural flood(.*| .*)" , replacement= "flood", x= storm$event , ignore.case= TRUE)
  storm$event <- gsub( pattern= "forest fire(.*| .*)" , replacement= "wildfire", x= storm$event , ignore.case= TRUE)
  storm$event <- gsub( pattern= "tidal(.*| .*)" , replacement= "storm surge/tide", x= storm$event , ignore.case= TRUE)
  storm$event <- gsub( pattern= "wind storm(.*| .*)" , replacement= "thunderstorm", x= storm$event , ignore.case= TRUE)
  storm$event <- gsub( pattern= "torn(.*| .*)" , replacement= "tornado", x= storm$event , ignore.case= TRUE)
  storm$event <- gsub( pattern= "(.*| .*)coastal flood(.*| .*)" , replacement= "coastal flood", x= storm$event , ignore.case= TRUE)
  storm$event <- gsub( pattern= "snowmelt flood(.*| .*)" , replacement= "flood", x= storm$event , ignore.case= TRUE)
  storm$event <- gsub( pattern= "^( *)thunderstorm" , replacement= "thunderstorm", x= storm$event , ignore.case= TRUE)
  storm$event <- gsub( pattern= "(.*| .*)heavy snow(.*| .*)" , replacement= "heavy snow", x= storm$event , ignore.case= TRUE)
  storm$event <- gsub( pattern= "record snow(.*| .*)" , replacement= "heavy snow", x= storm$event , ignore.case= TRUE)
  storm$event <- gsub( pattern= "^snow$" , replacement= "heavy snow", x= storm$event , ignore.case= TRUE)
  storm$event <- gsub( pattern= "(.*| .*)blowing snow(.*| .*)" , replacement= "heavy snow", x= storm$event , ignore.case= TRUE)
 
  storm$event <- gsub( pattern= "(.*| .*)hvy rain" , replacement= "heavy rain", x= storm$event , ignore.case= TRUE)
  storm$event <- gsub( pattern= "^fog$" , replacement= "dense fog", x= storm$event , ignore.case= TRUE)
  storm$event <- gsub( pattern= "(.*| .*)typhoon(.*| .*)" , replacement= "hurricane", x= storm$event , ignore.case= TRUE)
  storm$event <- gsub( pattern= "(.*| .*)flash(.*| .*)" , replacement= "flash flood", x= storm$event , ignore.case= TRUE)
  storm$event <- gsub( pattern= "( *)high(.*| .*)win(.*| .*)" , replacement= "high wind", x= storm$event , ignore.case= TRUE)
  storm$event <- gsub( pattern= "(.*| .*)marine(.*| .*)high(.*| .*)win(.*| .*)" , replacement= "marine high wind", x= storm$event , ignore.case= TRUE)
  storm$event <- gsub( pattern= "^wind chill(.*| .*)" , replacement= "cold/wind chill", x= storm$event , ignore.case= TRUE)


  storm$event <- factor ( storm$event )  # remove unused levels

Now the different entries for the event type are 308.

In order to reduce noise we decide to eliminate the rows with events not contained
in Storm Data Event Table on page 6 of NATIONAL WEATHER SERVICE INSTRUCTION.
In addition we add a short name for the event, in order to use it for graphical purposes.

    valid_events <- c("astronomical low tide" , "avalanche" , "blizzard", "coastal flood",
                      "cold/wind chill", "dense fog", "dense smoke","drought",
                      "dust devil",  "dust storm", "excessive heat","extreme cold/wind chill",
                     "flash flood", "flood", "frost/freeze", "funnel cloud", 
                     "freezing fog",  "hail", "heat", "heavy rain",
                     "heavy snow", "high surf", "high wind", "hurricane",
                     "ice storm", "lake-effect snow",  "lakeshore flood", "lightning",
                     "marine hail", "marine high wind", "marine strong wind", "marine thunderstorm",
                     "rip current",  "seiche",  "sleet", "storm surge/tide",
                     "strong wind", "thunderstorm", "tornado",  "tropical storm",
                     "tsunami", "volcanic ash", "waterspout",  "wildfire",
                     "winter storm",  "winter weather" )
    valid_events <- data.frame ( valid_events )
    valid_events$id <- 1:nrow( valid_events )
    colnames ( valid_events )[1] <- "event"

    storm$event_code <- as.numeric ( storm$event )  # id of event    
    storm <- storm[ storm$event %in% valid_events$event, ]

    storm$event <- factor ( storm$event )  # clean levels
    storm$short_name <- lapply ( strsplit(as.vector(storm$event) , split=" ") ,  substr , start=1, stop=4 ) 
    storm$short_name <- unlist(lapply ( storm$short_name , paste , sep="", collapse=".") )

The preprocessing step is complete. We can now proceed to analyze the clean dataset.

Results

Analysis of weather events impact on population health?

We initialize the dataset to be processed with dplyr functions

    storm <- tbl_df ( storm )

In order to assess the impact of the extreme events on population health we compute the number of
injuries and fatalities caused by each event type in the period since 1994.

    pop_health_events <- 
    storm %>%
      group_by (event , short_name ) %>%
      summarise(injuries= sum(INJURIES), fatalities= sum (FATALITIES), 
                events= n(), damage_event = sum(health_damage) ) %>%
      mutate ( tot_damage = injuries+fatalities , 
                perc_damage = damage_event/events  ) %>%
      mutate ( damage_per_event = tot_damage / events  ) %>%
      arrange ( -tot_damage  )

For each type of event we compute:
* injuries
* fatalities
* events: number of registered events
* damage_event: number of events that caused damages
* tot_damage: fatalities + injuries
* perc_damage: percentage of damaging events
* damage_per_event: average damage per event

    library ( xtable )
    xt <- xtable( pop_health_events )
    print ( xt , type= "html")
event short_name injuries fatalities events damage_event tot_damage perc_damage damage_per_event
1 tornado torn 22589.00 1593.00 25308 2185.00 24182.00 0.09 0.96
2 excessive heat exce.heat 6575.00 1922.00 1759 681.00 8497.00 0.39 4.83
3 flood floo 6782.00 459.00 25586 403.00 7241.00 0.02 0.28
4 thunderstorm thun 5988.00 436.00 230026 2650.00 6424.00 0.01 0.03
5 lightning ligh 5118.00 795.00 15326 3210.00 5913.00 0.21 0.39
6 heat heat 2508.00 1152.00 987 243.00 3660.00 0.25 3.71
7 flash flood flas.floo 1770.00 1007.00 54640 945.00 2777.00 0.02 0.05
8 ice storm ice.stor 1989.00 87.00 1991 96.00 2076.00 0.05 1.04
9 wildfire wild 1456.00 87.00 4231 331.00 1543.00 0.08 0.36
10 high wind high.wind 1259.00 256.00 21307 589.00 1515.00 0.03 0.07
11 winter storm wint.stor 1313.00 196.00 11406 222.00 1509.00 0.02 0.13
12 hurricane hurr 1332.00 135.00 295 70.00 1467.00 0.24 4.97
13 heavy snow heav.snow 1024.00 132.00 15801 195.00 1156.00 0.01 0.07
14 dense fog dens.fog 1063.00 75.00 1825 121.00 1138.00 0.07 0.62
15 rip current rip.curr 529.00 572.00 774 637.00 1101.00 0.82 1.42
16 hail hail 953.00 10.00 222782 211.00 963.00 0.00 0.00
17 winter weather wint.weat 538.00 61.00 8102 76.00 599.00 0.01 0.07
18 extreme cold/wind chill extr.cold.chil 259.00 295.00 1939 220.00 554.00 0.11 0.29
19 blizzard bliz 390.00 71.00 2692 72.00 461.00 0.03 0.17
20 dust storm dust.stor 439.00 22.00 426 44.00 461.00 0.10 1.08
21 tropical storm trop.stor 383.00 66.00 753 43.00 449.00 0.06 0.60
22 high surf high.surf 246.00 165.00 1063 159.00 411.00 0.15 0.39
23 strong wind stro.wind 301.00 110.00 3772 234.00 411.00 0.06 0.11
24 avalanche aval 171.00 225.00 383 241.00 396.00 0.63 1.03
25 heavy rain heav.rain 241.00 98.00 11778 120.00 339.00 0.01 0.03
26 cold/wind chill cold.chil 60.00 168.00 701 125.00 228.00 0.18 0.33
27 tsunami tsun 129.00 33.00 20 2.00 162.00 0.10 8.10
28 storm surge/tide stor.surg 43.00 13.00 532 11.00 56.00 0.02 0.11
29 marine thunderstorm mari.thun 34.00 19.00 11987 17.00 53.00 0.00 0.00
30 dust devil dust.devi 43.00 2.00 145 15.00 45.00 0.10 0.31
31 marine strong wind mari.stro.wind 22.00 14.00 48 16.00 36.00 0.33 0.75
32 waterspout wate 33.00 2.00 3764 6.00 35.00 0.00 0.01
33 coastal flood coas.floo 9.00 10.00 830 12.00 19.00 0.01 0.02
34 drought drou 4.00 2.00 2491 2.00 6.00 0.00 0.00
35 frost/freeze fros 3.00 1.00 1494 1.00 4.00 0.00 0.00
36 funnel cloud funn.clou 3.00 0.00 6676 2.00 3.00 0.00 0.00
37 marine high wind mari.high.wind 1.00 1.00 135 1.00 2.00 0.01 0.01
38 sleet slee 0.00 2.00 114 1.00 2.00 0.01 0.02
39 astronomical low tide astr.low.tide 0.00 0.00 174 0.00 0.00 0.00 0.00
40 dense smoke dens.smok 0.00 0.00 10 0.00 0.00 0.00 0.00
41 freezing fog free.fog 0.00 0.00 46 0.00 0.00 0.00 0.00
42 lake-effect snow lake.snow 0.00 0.00 657 0.00 0.00 0.00 0.00
43 lakeshore flood lake.floo 0.00 0.00 24 0.00 0.00 0.00 0.00
44 marine hail mari.hail 0.00 0.00 442 0.00 0.00 0.00 0.00
45 seiche seic 0.00 0.00 21 0.00 0.00 0.00 0.00
46 volcanic ash volc.ash 0.00 0.00 29 0.00 0.00 0.00 0.00

We can easily verify that the event type with the greatest impact on the population health
is tornado with nearly 24.182 thousand occurrences.

A different view of the impact compare damage and frequency of events.
In the following chart we can see that events cluster in different ways.
There are weak events that caused a lot of damage because of their frequency (i.e. thunderstorms),
and there are rare but very harmful events (such as tsunamis).

    ggplot( subset( pop_health_events , damage_event>=1 )  , 
            aes( x= log10(events) , y= log10(tot_damage) , size= damage_per_event  ) ) +
            geom_point(shape=21, colour="black", fill="cornsilk") +
      geom_text(aes( x= log10(events) , y= log10(tot_damage) , label=short_name), size=4  ) +
      labs( title="Chart 1\nImpact on population health\nper event (1994-2011)" ,  
            x="Event Frequency" , y="Total Damage" )+
      scale_size_area(max_size=20) +
      scale_y_continuous(breaks=0:4, labels=10^c(0:4))+
      scale_x_continuous(breaks=2:4, labels=10^c(2:4))

plot of chunk popul_scatterplot Chart 1: Each event is represented with respect to the number of observations in the period
and the total damages. The size of the circle represents the average magnitude per event (fatalities+injures).

Analysis of economic consequences of weather events

In order to compute the impact of the extreme events on economy we compute the number of
injuries and fatalities caused by each event type in the period since 1994.

    econ_events <- 
    storm %>%
      mutate ( tot_damage= prop_damage + crop_damage )%>%
      group_by ( event , short_name ) %>%
      summarise (  events= n() , 
                   damage_events= sum(economic_damage) ,
                   tot_damage=sum( tot_damage ) ,  
                   prop_damage= sum( prop_damage) , 
                   crop_damage=sum( crop_damage ) ) %>%
      mutate ( perc_damage = damage_events/events , damage_per_event = tot_damage/ events )%>%
      arrange ( -tot_damage  )

For each type of event we compute: * events: number of registered events
* damage_events: number of events that caused damages
* tot_damage: total damages in USD
* prop_damage: damages to properties
* crop_damage: damages to crops
* perc_damage: percentage of damaging events
* damage_per_event: average damage per event

Data are sorted by total damages.

    xt <- xtable( econ_events )
    print ( xt , type= "html")
event short_name events damage_events tot_damage prop_damage crop_damage perc_damage damage_per_event
1 flood floo 25586 17539.00 150172206650.00 144490820700.00 5681385950.00 0.69 5869311.60
2 hurricane hurr 295 224.00 90816527810.00 85300910010.00 5515617800.00 0.76 307852636.64
3 storm surge/tide stor.surg 532 323.00 47845017000.00 47844162000.00 855000.00 0.61 89934242.48
4 tornado torn 25308 16900.00 25986972420.00 25625139910.00 361832510.00 0.67 1026828.37
5 hail hail 222782 93514.00 18580948920.00 15577316670.00 3003632250.00 0.42 83404.18
6 flash flood flas.floo 54640 32943.00 17990711610.00 16477220060.00 1513491550.00 0.60 329259.00
7 drought drou 2491 1533.00 14968177780.00 1046106000.00 13922071780.00 0.62 6008903.16
8 thunderstorm thun 230026 155792.00 11794391430.00 10531054330.00 1263337100.00 0.68 51274.17
9 ice storm ice.stor 1991 1303.00 8855041310.00 3832927810.00 5022113500.00 0.65 4447534.56
10 tropical storm trop.stor 753 631.00 8410013550.00 7715622550.00 694391000.00 0.84 11168676.69
11 wildfire wild 4231 2324.00 8280845130.00 7877563500.00 403281630.00 0.55 1957183.91
12 high wind high.wind 21307 14484.00 6084335610.00 5416038810.00 668296800.00 0.68 285555.71
13 heavy rain heav.rain 11778 5848.00 4014950940.00 3224195140.00 790755800.00 0.50 340885.63
14 frost/freeze fros 1494 1065.00 1904761000.00 18700000.00 1886061000.00 0.71 1274940.43
15 winter storm wint.stor 11406 7495.00 1637041250.00 1620097250.00 16944000.00 0.66 143524.57
16 extreme cold/wind chill extr.cold.chil 1939 931.00 1457263400.00 127240400.00 1330023000.00 0.48 751554.10
17 lightning ligh 15326 10712.00 889238720.00 877251630.00 11987090.00 0.70 58021.58
18 heavy snow heav.snow 15801 6979.00 886744740.00 757561640.00 129183100.00 0.44 56119.53
19 blizzard bliz 2692 1839.00 539218950.00 532158950.00 7060000.00 0.68 200304.22
20 excessive heat exce.heat 1759 727.00 500155700.00 7753700.00 492402000.00 0.41 284340.93
21 coastal flood coas.floo 830 614.00 438876560.00 438820560.00 56000.00 0.74 528766.94
22 heat heat 987 647.00 419528550.00 12457050.00 407071500.00 0.66 425054.26
23 strong wind stro.wind 3772 3390.00 247572740.00 177619240.00 69953500.00 0.90 65634.34
24 tsunami tsun 20 19.00 144082000.00 144062000.00 20000.00 0.95 7204100.00
25 high surf high.surf 1063 463.00 101025000.00 101025000.00 0.00 0.44 95037.63
26 cold/wind chill cold.chil 701 533.00 99286600.00 2544050.00 96742550.00 0.76 141635.66
27 winter weather wint.weat 8102 6808.00 42298000.00 27298000.00 15000000.00 0.84 5220.69
28 lake-effect snow lake.snow 657 618.00 40182000.00 40182000.00 0.00 0.94 61159.82
29 dense fog dens.fog 1825 1012.00 22224500.00 22224500.00 0.00 0.55 12177.81
30 dust storm dust.stor 426 238.00 9194000.00 5594000.00 3600000.00 0.56 21582.16
31 waterspout wate 3764 1106.00 8414700.00 8414700.00 0.00 0.29 2235.57
32 lakeshore flood lake.floo 24 24.00 7570000.00 7570000.00 0.00 1.00 315416.67
33 marine thunderstorm mari.thun 11987 5918.00 5907400.00 5857400.00 50000.00 0.49 492.82
34 avalanche aval 383 192.00 3716800.00 3716800.00 0.00 0.50 9704.44
35 freezing fog free.fog 46 39.00 2182000.00 2182000.00 0.00 0.85 47434.78
36 sleet slee 114 62.00 1500000.00 1500000.00 0.00 0.54 13157.89
37 marine high wind mari.high.wind 135 135.00 1297010.00 1297010.00 0.00 1.00 9607.48
38 seiche seic 21 12.00 980000.00 980000.00 0.00 0.57 46666.67
39 dust devil dust.devi 145 104.00 712630.00 712630.00 0.00 0.72 4914.69
40 volcanic ash volc.ash 29 5.00 500000.00 500000.00 0.00 0.17 17241.38
41 marine strong wind mari.stro.wind 48 48.00 418330.00 418330.00 0.00 1.00 8715.21
42 astronomical low tide astr.low.tide 174 174.00 320000.00 320000.00 0.00 1.00 1839.08
43 rip current rip.curr 774 285.00 163000.00 163000.00 0.00 0.37 210.59
44 funnel cloud funn.clou 6676 2424.00 144600.00 144600.00 0.00 0.36 21.66
45 dense smoke dens.smok 10 9.00 100000.00 100000.00 0.00 0.90 10000.00
46 marine hail mari.hail 442 209.00 4000.00 4000.00 0.00 0.47 9.05

Floods and hurricanes caused the greatest economic damage.
Even with respect to economic damage we can observe the clustering between the
destruction power and frequency of events. We focus on events with total damage greater than 1 million dollars.

    ggplot( subset(econ_events , tot_damage>=1e6)  , 
            aes( x= log10(events) , y= log10(tot_damage/1e6) , size= damage_per_event  ) ) +
            geom_point(shape=21, colour="black", fill="cornsilk") +
      geom_text(aes( x= log10(events) , y= log10(tot_damage/1e6) , label= short_name ), size=4  ) +
      labs( title="Chart 2\nEconomic consequences\nper event (1994-2011)" ,  x="Event Frequency" ,y="Total Damage (mln $)" )+
      scale_size_area(max_size=30) +
      scale_y_continuous(breaks=0:5, labels=10^c(0:5))+
      scale_x_continuous(breaks=1:5, labels=10^c(1:5))

plot of chunk econ_scatterplot Chart 2: Each event is represented with respect to the number of observations in the period
and the total damages. The size of the circle represents the average magnitude per event (USD). We can see there are events with high frequency but low impact per single event
(such as floods and hails) that have high impact on the economy. We can also see that events like hurricanes or storm surge/tide have high impact
due to their power.

Conclusions

The analysis of the impact of severe weather events on population’s health and economy
shows different patterns in damaging events. Events with greatest impact on population’s health can be divided into high frequency/low magnitude
(such as tornadoes, thunderstorms and floods) and medium frequency/high magnitude
(hurricanes, heat and ice storm).
Events with greatest impact on population’s economy con be divided into high frequency/low magnitude
(such as tornadoes and floods) and low frequency/high magnitude
(hurricanes, storm surge / tide).

Hurricanes, floods and tornadoes are the most harmful events observed in the period 1994-2011.