Synopsis

This analysis shows the impacts over public health and economy of severe weather events happened in the USA during the period 1950-2011. The source of data is the NOAA Storm Database. The outcome of the analysis concerning the impact on public health definitely assigns to tornadoes the first cause of casualties, followed by heat. From an economic standpoint it appears that flood and rough waters events have the highest impact. From a policy perspective, apart from the countermeasures common to all types of events, the investments related to protection of people and properties from tornadoes, heat and water related events (i.e., flood, rough waters) should be addressed with higher priority with respect to others.

Data Processing

Getting the Data File

The NOAA Storm Database contains a record of the events concerning severe weather occurred in the USA. It can be consulted and downloaded in CVS format at http://www.ncdc.noaa.gov/stormevents/. It is kept updated by NOAA and currently contains events up to Oct. 2014.

The specific version used for this analysis contains events occurred in the USA from Jan. 1950 to Nov. 2011. It can be downloaded in a BZ2 compressed CSV file at the Coursera Reproducible Research link https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2.

Downloading the Data File

#if file is not present in the current home, download it
datafile <- "repdata%2Fdata%2FStormData.csv.bz2"
if (!file.exists(paste0("~/", datafile))) {
    url <- "http://d396qusza40orc.cloudfront.net/"
    download.file(paste0(url, datafile), paste0("~/", datafile))
}

Creating the Dataset

The dataset weather_events is created by reading the datafile. The bzfile connection is used, because the original file is BZ2 compressed.

weather_events <- read.csv(bzfile(paste0("~/", datafile)))

The structure of the dataset is as follows:

str(weather_events)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","  N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

For the purpose of this analysis, the variables to be considered are:

  • EVTYPE: describes the event type.
  • FATALITIES: number ot deaths in relation to the event.
  • INJURIES: number of injured persons in relation to the event.
  • PROPDMG: damage to properties in US dollars.
  • PROPDMGEXP: exponent to be applied to PROPDMG.
  • CROPDMG: damage to crop fields in US dollars.
  • CROPDMGEXP: exponent to be applied to CROPDMG (same concept of PROPDMGEXP).

The meaning of the exponent variables (PROPDMGEXP and CROPDMGEXP) is as in the following example: if PROPDMG=2.5 and PROPDMGEXP=“K” (for kilo, i.e, 1000), the amount of the damage resulted in 2.5 * 1000 = 2,500 USD.

Cleaning the dataset

The extraction of unique event types might help in understanding the correctness of data (for editorial reasons, only the first 120 types, out of 985, are listed here)

head(unique(weather_events$EVTYPE), n=60L)
##  [1] TORNADO                        TSTM WIND                     
##  [3] HAIL                           FREEZING RAIN                 
##  [5] SNOW                           ICE STORM/FLASH FLOOD         
##  [7] SNOW/ICE                       WINTER STORM                  
##  [9] HURRICANE OPAL/HIGH WINDS      THUNDERSTORM WINDS            
## [11] RECORD COLD                    HURRICANE ERIN                
## [13] HURRICANE OPAL                 HEAVY RAIN                    
## [15] LIGHTNING                      THUNDERSTORM WIND             
## [17] DENSE FOG                      RIP CURRENT                   
## [19] THUNDERSTORM WINS              FLASH FLOOD                   
## [21] FLASH FLOODING                 HIGH WINDS                    
## [23] FUNNEL CLOUD                   TORNADO F0                    
## [25] THUNDERSTORM WINDS LIGHTNING   THUNDERSTORM WINDS/HAIL       
## [27] HEAT                           WIND                          
## [29] LIGHTING                       HEAVY RAINS                   
## [31] LIGHTNING AND HEAVY RAIN       FUNNEL                        
## [33] WALL CLOUD                     FLOODING                      
## [35] THUNDERSTORM WINDS HAIL        FLOOD                         
## [37] COLD                           HEAVY RAIN/LIGHTNING          
## [39] FLASH FLOODING/THUNDERSTORM WI WALL CLOUD/FUNNEL CLOUD       
## [41] THUNDERSTORM                   WATERSPOUT                    
## [43] EXTREME COLD                   HAIL 1.75)                    
## [45] LIGHTNING/HEAVY RAIN           HIGH WIND                     
## [47] BLIZZARD                       BLIZZARD WEATHER              
## [49] WIND CHILL                     BREAKUP FLOODING              
## [51] HIGH WIND/BLIZZARD             RIVER FLOOD                   
## [53] HEAVY SNOW                     FREEZE                        
## [55] COASTAL FLOOD                  HIGH WIND AND HIGH TIDES      
## [57] HIGH WIND/BLIZZARD/FREEZING RA HIGH TIDES                    
## [59] HIGH WIND AND HEAVY SNOW       RECORD COLD AND HIGH WIND     
## 985 Levels:    HIGH SURF ADVISORY  COASTAL FLOOD ... WND

It’s quite evident that here we have a problem: the event types were not coded in a consistent way over the years. There are typos (e.g., “LIGHTING”), different text describing the same event (e.g., various “HURRICANE…”), not understandable events (e.g., “SUMMARY OF…”, not visible in the above list). Therefore, a cleaning of the EVTYPE variable is needed before proceeding with the analysis.

First, the event type will be made upper case. Then the text will be homogenized as much as possible.

#convert EVTYPE to upper case and transform it to a string
weather_events$EVTYPE <- toupper(as.character(weather_events$EVTYPE))

#homogenize text
weather_events$EVTYPE[grepl("AVALAN", weather_events$EVTYPE)] <- "AVALANCHE"
weather_events$EVTYPE[grepl("COLD", weather_events$EVTYPE)] <- "COLD"
weather_events$EVTYPE[grepl("DRY", weather_events$EVTYPE)] <- "DRY"
weather_events$EVTYPE[grepl("DUST|SMOKE|FUNNEL", weather_events$EVTYPE)] <- "DUST"
weather_events$EVTYPE[grepl("FIRE", weather_events$EVTYPE)] <- "WILDFIRE"
weather_events$EVTYPE[grepl("FLOOD", weather_events$EVTYPE)] <- "FLOOD"
weather_events$EVTYPE[grepl("FREEZ|FROST", weather_events$EVTYPE)] <- "FREEZE"
weather_events$EVTYPE[grepl("HAIL", weather_events$EVTYPE)] <- "HAIL"
weather_events$EVTYPE[grepl("HEAT", weather_events$EVTYPE)] <- "HEAT"
weather_events$EVTYPE[grepl("ICE", weather_events$EVTYPE)] <- "ICE"
weather_events$EVTYPE[grepl("LAND", weather_events$EVTYPE)] <- "LANDSLIDE"
weather_events$EVTYPE[grepl("LIGHT|LIGNT", weather_events$EVTYPE)] <- "LIGHTNING"
weather_events$EVTYPE[grepl("LOW TEMP", weather_events$EVTYPE)] <- "LOW TEMPERATURE"
weather_events$EVTYPE[grepl("MUD", weather_events$EVTYPE)] <- "MUD SLIDE"
weather_events$EVTYPE[grepl("RAIN|PRECIP", weather_events$EVTYPE)] <- "RAIN"
weather_events$EVTYPE[grepl("RIP|ROGUE|ROUGH|DROWN|SURF|SEA|WATER|WAVE|MARINE|TIDE", weather_events$EVTYPE)] <- "ROUGH WATERS"
weather_events$EVTYPE[grepl("HURRI", weather_events$EVTYPE)] <- "HURRICANE"
weather_events$EVTYPE[grepl("SNOW|BLIZZ", weather_events$EVTYPE)] <- "SNOW"
weather_events$EVTYPE[grepl("SPOUT", weather_events$EVTYPE)] <- "WATERSPOUT"
weather_events$EVTYPE[grepl("STORM", weather_events$EVTYPE)] <- "STORM"
weather_events$EVTYPE[grepl("SUMMARY|OTHER|[?]", weather_events$EVTYPE)] <- "UNKNOWN"
weather_events$EVTYPE[grepl("THUNDERSTORM|TSTM", weather_events$EVTYPE)] <- "THUNDERSTORM"
weather_events$EVTYPE[grepl("TORN", weather_events$EVTYPE)] <- "TORNADO"
weather_events$EVTYPE[grepl("UNSEAS|URBAN|RECORD|ABNOR", weather_events$EVTYPE)] <- "OTHER"
weather_events$EVTYPE[grepl("VOLC", weather_events$EVTYPE)] <- "VOLCANO"
weather_events$EVTYPE[grepl("WARM", weather_events$EVTYPE)] <- "WARM"
weather_events$EVTYPE[grepl("WIND|WND|GUST", weather_events$EVTYPE)] <- "WIND"
weather_events$EVTYPE[grepl("WINTER|WINTRY", weather_events$EVTYPE)] <- "WINTER WEATHER"

#transform back EVTYPE into a factor
weather_events$EVTYPE <- factor(weather_events$EVTYPE)

According to the NWS Directive 10-1605, para. 2.7, page 12, only three types of damage exponents should be present: K (1,000), M (1,000,000) and B (1,000,000,000). Unfortunately, by retrieving the unique values, the full list of exponents for property damages shows:

unique(weather_events$PROPDMGEXP)
##  [1] K M   B m + 0 5 6 ? 4 2 3 h 7 H - 1 8
## Levels:  - ? + 0 1 2 3 4 5 6 7 8 B h H K m M

And for crop damages:

unique(weather_events$CROPDMGEXP)
## [1]   M K m B ? 0 k 2
## Levels:  ? 0 2 B k K m M

Therefore, also for these variables we need a strategy to cope with unexpected content. Taking into consideration that we don’t have a clue concerning the meaning of those letters, the decision is made to ignore the records corresponding to them: the value corresponding to those events will be set to 0. Concerning the lowercase “m” and “k” present, they will be made uppercase and considered as “M” and “K”.

#create a subset of data for the economical impact analysis
econ_impact <- subset(weather_events, select=c(EVTYPE, PROPDMG, PROPDMGEXP,
    CROPDMG, CROPDMGEXP))
#add two variables to the dataset, which will hold the value of the damage
econ_impact$PROPVAL <- 0
econ_impact$CROPVAL <- 0
#transform the exponents into upper case
econ_impact$PROPDMGEXP <- toupper(as.character(econ_impact$PROPDMGEXP))
econ_impact$CROPDMGEXP <- toupper(as.character(econ_impact$CROPDMGEXP))

#cycle through the dataset
for (i in 1:nrow(econ_impact)) {

    #compute value for property damages
    if (econ_impact$PROPDMGEXP[i] == "K") {
        econ_impact$PROPVAL[i] <- econ_impact$PROPDMG[i] * 1000
    } else if (econ_impact$PROPDMGEXP[i] == "M") {
        econ_impact$PROPVAL[i] <- econ_impact$PROPDMG[i] * 1000000
    } else if (econ_impact$PROPDMGEXP[i] == "B") {
        econ_impact$PROPVAL[i] <- econ_impact$PROPDMG[i] * 1000000000
    } else {
        econ_impact$PROPVAL[i] <- 0
    }

    #compute value for crop field damages
    if (econ_impact$CROPDMGEXP[i] == "K") {
        econ_impact$CROPVAL[i] <- econ_impact$CROPDMG[i] * 1000
    } else if (econ_impact$CROPDMGEXP[i] == "M") {
        econ_impact$CROPVAL[i] <- econ_impact$CROPDMG[i] * 1000000
    } else if (econ_impact$CROPDMGEXP[i] == "B") {
        econ_impact$CROPVAL[i] <- econ_impact$CROPDMG[i] * 1000000000
    } else {
        econ_impact$CROPVAL[i] <- 0
    }

}

Results

The dataset is now ready for the analysis to be made. The impact on health is calculated by creating a dataset containing the aggregation of casualties by event. Then, the top 10 fatalities and injuries are extracted.

#create an aggregated dataset of casualties by event
CasualtiesByEvent <- aggregate(weather_events[c("FATALITIES", "INJURIES")],
                               by=weather_events["EVTYPE"],
                               FUN="sum")

#compute and show the top 10 fatalities
top10Fatalities <- head(CasualtiesByEvent[order(CasualtiesByEvent$FATALITIES, decreasing=TRUE),], 10)
top10Fatalities
##          EVTYPE FATALITIES INJURIES
## 69      TORNADO       5633    91364
## 26         HEAT       3138     9224
## 21        FLOOD       1525     8604
## 58 ROUGH WATERS        837      972
## 43    LIGHTNING        818     5234
## 68 THUNDERSTORM        505     6962
## 67        STORM        501     4227
## 7          COLD        451      320
## 85         WIND        442     1852
## 65         SNOW        244     1954
#compute and show the top 10 injuries
top10Injuries <- head(CasualtiesByEvent[order(CasualtiesByEvent$INJURIES, decreasing=TRUE),], 10)
top10Injuries
##          EVTYPE FATALITIES INJURIES
## 69      TORNADO       5633    91364
## 26         HEAT       3138     9224
## 21        FLOOD       1525     8604
## 68 THUNDERSTORM        505     6962
## 43    LIGHTNING        818     5234
## 67        STORM        501     4227
## 39          ICE        102     2164
## 65         SNOW        244     1954
## 85         WIND        442     1852
## 84     WILDFIRE         90     1608
#create a panel plot of the results
par(mfrow=c(2,1))
par(cex = 0.6)
barplot(top10Fatalities$FATALITIES,
        names=top10Fatalities$EVTYPE,
        axes=FALSE,
        main="Top 10 Casualties by Type of Event - USA 1950-2011\nSource: NOAA Storm Events Database",
        ylab="Fatalities",
        las=2)
axis(2, at=seq(0, 7000, 500))

barplot(top10Injuries$INJURIES,
        names=top10Injuries$EVTYPE,
        axes=FALSE,
        ylab="Injuries",
        las=2)
axis(2, at=seq(0, 100000, 15000))

It appears that tornadoes are, by far, the most dangerous type of event for human health, causing the highest number of both fatalities and injuries. They are followed by heat and flood, again in both fatalities and injuries. Interesting to be noted: whilst rough waters are the fourth cause of fatalities, that event doesn’t appear in the top 10 of the injuries: one could say that water doesn’t forgive.

The impact on economy is calculated by creating a dataset containing the aggregation of damages in US dollars by event. Then, the top 10 events causing damages to properties and to crop fields are extracted.

#create an aggregated dataset of damages to properties and to crop fields by event
DamagesByEvent <- aggregate(econ_impact[c("PROPVAL", "CROPVAL")],
                               by=econ_impact["EVTYPE"],
                               FUN="sum")

#compute and show the top 10 damages to properties
top10PropDamages <- head(DamagesByEvent[order(DamagesByEvent$PROPVAL, decreasing=TRUE),], 10)
top10PropDamages
##          EVTYPE      PROPVAL     CROPVAL
## 21        FLOOD 167529740320 12380109100
## 37    HURRICANE  84756180010  5515292800
## 67        STORM  64217927990  1380141300
## 69      TORNADO  56941932180   414961310
## 25         HAIL  17619990720  3114212850
## 84     WILDFIRE   8501628500   403281630
## 85         WIND   6070315700   767316950
## 58 ROUGH WATERS   5869280440 13973476000
## 68 THUNDERSTORM   4493691940   554007350
## 39          ICE   3968294360  5022114300
#compute and show the top 10 damages to crop fields
top10CropDamages <- head(DamagesByEvent[order(DamagesByEvent$CROPVAL, decreasing=TRUE),], 10)
top10CropDamages
##          EVTYPE      PROPVAL     CROPVAL
## 58 ROUGH WATERS   5869280440 13973476000
## 21        FLOOD 167529740320 12380109100
## 37    HURRICANE  84756180010  5515292800
## 39          ICE   3968294360  5022114300
## 25         HAIL  17619990720  3114212850
## 23       FREEZE     37854500  1997061000
## 7          COLD    246579450  1416765550
## 67        STORM  64217927990  1380141300
## 26         HEAT     20325750   904469280
## 53         RAIN   3234687690   806162800
#create a panel plot of the results
par(mfrow=c(2,1))
par(cex = 0.6)
barplot(top10PropDamages$PROPVAL,
        names=top10PropDamages$EVTYPE,
        axes=FALSE,
        main="Top 10 Damages by Type of Event - USA 1950-2011\nSource: NOAA Storm Events Database",
        ylab="Damages to Properties",
        las=2)
axis(2, at=seq(0, 200e9, 50e9))

barplot(top10CropDamages$CROPVAL,
        names=top10CropDamages$EVTYPE,
        axes=FALSE,
        ylab="Damages to Crop Fields",
        las=2)
axis(2, at=seq(0, 14e9, 1e9))

In this case floods, hurricanes, storms and tornadoes are the events causing most of the damages to properties. Crop fields are mostly impacted by rough waters and floods and, to a minor extent, by hurricanes and ice.