Synopsis

Storm data recorded by U.S. National Oceanic and Atmospheric Administration (NOAA) was explored in order to identify the most harmful storm events across the US. Observations between January 1993 and November 2011 were retained after processing the dataset. The total number of casualties (deaths and injuries) was calculated for each type of storm event as a measure of harm to population health. The total damage (in thousands of dollars) to property and crops was calculated for each type of storm event as a measure of the economic consequences of the events. The events causing the greatest harm to population health were found to be tornadoes and floods. The events with the largest economic consequences were found to be floods, hurricanes and tornadoes.

Data Processing

To load data from compressed file:

data <- read.csv("repdata-data-StormData.csv.bz2")

Size of data: 902298 rows (observations) and 37 columns (variables).

dim(data)
## [1] 902297     37

The first few rows of the data look like this:

head(data)
##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL
## 6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL
##    EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO         0                                               0
## 2 TORNADO         0                                               0
## 3 TORNADO         0                                               0
## 4 TORNADO         0                                               0
## 5 TORNADO         0                                               0
## 6 TORNADO         0                                               0
##   COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1         NA         0                      14.0   100 3   0          0
## 2         NA         0                       2.0   150 2   0          0
## 3         NA         0                       0.1   123 2   0          0
## 4         NA         0                       0.0   100 2   0          0
## 5         NA         0                       0.0   150 2   0          0
## 6         NA         0                       1.5   177 2   0          0
##   INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1       15    25.0          K       0                                    
## 2        0     2.5          K       0                                    
## 3        2    25.0          K       0                                    
## 4        2     2.5          K       0                                    
## 5        2     2.5          K       0                                    
## 6        6     2.5          K       0                                    
##   LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1     3040      8812       3051       8806              1
## 2     3042      8755          0          0              2
## 3     3340      8742          0          0              3
## 4     3458      8626          0          0              4
## 5     3412      8642          0          0              5
## 6     3450      8748          0          0              6

There are 48 event types defined in NWSI 10-1605 part 2.1.1.
There are 985 types of events recorded in the data frame.

length(unique(data$EVTYPE))
## [1] 985

For consistency, the event types must be edited and organised into the 48 types. Firstly, any events which caused no damage to population health or economic damage are irrelevant to this study, and so were removed. Next, any observations recording summary or monthly data are not events, and were also removed. Thereafter, unnecessary adjectives or whitespace in the names of the events were removed, allowing some events of the same type to be united.

#copy raw data frame into new data frame to be tidied
tidy <- data

#remove all observations that had no recorded health or economic damage
tidy <- tidy[tidy$FATALITIES > 0 | tidy$INJURIES > 0 | tidy$PROPDMG > 0 | tidy$CROPDMG > 0,]
#all types as lower case characters
tidy$EVTYPE <- tolower(tidy$EVTYPE)
#remove rows containing "summary" or "monthly"
tidy <- tidy[!grepl("summary",tidy$EVTYPE),]
tidy <- tidy[!grepl("monthly",tidy$EVTYPE),]
#combine categories into proper event types
tidy$EVTYPE <- gsub("urban","",tidy$EVTYPE)
tidy$EVTYPE <- gsub("unseasonal","",tidy$EVTYPE) 
tidy$EVTYPE <- gsub("unseasonably","",tidy$EVTYPE)
tidy$EVTYPE <- gsub("unseasonable","",tidy$EVTYPE)
tidy$EVTYPE <- gsub("unusually","",tidy$EVTYPE)
tidy$EVTYPE <- gsub("unusual","",tidy$EVTYPE)
tidy$EVTYPE <- gsub("severe","",tidy$EVTYPE)
tidy$EVTYPE <- gsub("small","",tidy$EVTYPE)
tidy$EVTYPE <- gsub("seasonal","",tidy$EVTYPE)
tidy$EVTYPE <- gsub("stream","",tidy$EVTYPE)
tidy$EVTYPE <- gsub("and","",tidy$EVTYPE)
tidy$EVTYPE <- gsub("river","",tidy$EVTYPE)
tidy$EVTYPE <- gsub("record","",tidy$EVTYPE)
tidy$EVTYPE <- gsub("prolonged","",tidy$EVTYPE)
tidy$EVTYPE <- gsub("prolong","",tidy$EVTYPE)
tidy$EVTYPE <- gsub("gusty","",tidy$EVTYPE)
tidy$EVTYPE <- gsub("early","",tidy$EVTYPE)
tidy$EVTYPE <- gsub("bitter","",tidy$EVTYPE)
tidy$EVTYPE <- gsub("blowing","",tidy$EVTYPE)
tidy$EVTYPE <- gsub("/"," ",tidy$EVTYPE)
tidy$EVTYPE <- gsub("damaging","",tidy$EVTYPE)
tidy$EVTYPE <- gsub("hard","",tidy$EVTYPE)

tidy$EVTYPE <- trimws(tidy$EVTYPE)

This is still not enough. The number of types of events is still much larger than 48.

length(unique(tidy$EVTYPE))
## [1] 404

In order to unify more types of events it was necessary to do string searches to look for synonyms or spelling mistakes and rename them with an event name from the list of the 48 defined types. Whilst searching through the names of the events, some of them were found to be too ambiguous and were defined as “bad”and subsequently removed.

tidy$EVTYPE[grepl("winter we",tidy$EVTYPE)] <- "winter weather"
tidy$EVTYPE[grepl("wintry",tidy$EVTYPE)] <- "winter weather"
tidy$EVTYPE[grepl("winter s",tidy$EVTYPE)] <- "winter storm"
tidy$EVTYPE[grepl("fire",tidy$EVTYPE)] <- "wildfire"
tidy$EVTYPE[grepl("wet microburst",tidy$EVTYPE)] <- "heavy rain"
tidy$EVTYPE[grepl("warm",tidy$EVTYPE)] <- "heat"
tidy$EVTYPE[grepl("spout",tidy$EVTYPE)] <- "waterspout"
tidy$EVTYPE[grepl("volcanic",tidy$EVTYPE)] <- "volcanic ash"
tidy$EVTYPE[grepl("typhoon",tidy$EVTYPE)] <- "hurricane(typhoon)"
tidy$EVTYPE[grepl("tropical storm",tidy$EVTYPE)] <- "tropical storm"
tidy$EVTYPE[grepl("^tstm",tidy$EVTYPE)] <- "thunderstorm wind"
tidy$EVTYPE[grepl("thundersnow",tidy$EVTYPE)] <- "heavy snow"
tidy$EVTYPE[grepl("^thu",tidy$EVTYPE)] <- "thunderstorm wind"
tidy$EVTYPE[grepl("^tunder",tidy$EVTYPE)] <- "thunderstorm wind"
tidy$EVTYPE[grepl("torn",tidy$EVTYPE)] <- "tornado"
tidy$EVTYPE[grepl("tidal",tidy$EVTYPE)] <- "coastal flood"
tidy$EVTYPE[grepl("^strong wind",tidy$EVTYPE)] <- "strong wind"
tidy$EVTYPE[grepl("storm surge",tidy$EVTYPE)] <- "storm surge/tide"
tidy$EVTYPE[grepl("snowmelt",tidy$EVTYPE)] <- "flood"
tidy$EVTYPE[grepl("storm force",tidy$EVTYPE)] <- "strong wind"
tidy$EVTYPE[grepl("rogue",tidy$EVTYPE)] <- "storm surge/tide"
tidy$EVTYPE[grepl("rough",tidy$EVTYPE)] <- "high surf"
tidy$EVTYPE[grepl("sleet",tidy$EVTYPE)] <- "sleet"
tidy$EVTYPE[grepl("^rain",tidy$EVTYPE)] <- "heavy rain"
tidy$EVTYPE[grepl("^snow",tidy$EVTYPE)] <- "heavy snow"
tidy$EVTYPE[grepl("heavy wet snow",tidy$EVTYPE)] <- "heavy snow"
tidy$EVTYPE[grepl("heavy shower",tidy$EVTYPE)] <- "heavy rain"
tidy$EVTYPE[grepl("rip",tidy$EVTYPE)] <- "rip current"
tidy$EVTYPE[grepl("rapid",tidy$EVTYPE)] <- "coastal flood"
tidy$EVTYPE[grepl("microburst",tidy$EVTYPE)] <- "high wind"
tidy$EVTYPE[grepl("lightning",tidy$EVTYPE)] <- "lightning"
tidy$EVTYPE[grepl("lighting",tidy$EVTYPE)] <- "lightning"
tidy$EVTYPE[grepl("ligntning",tidy$EVTYPE)] <- "lightning"
tidy$EVTYPE[grepl("lake effect",tidy$EVTYPE)] <- "lake-effect snow"
tidy$EVTYPE[grepl("hurricane",tidy$EVTYPE)] <- "hurricane(typhoon)"
tidy$EVTYPE[grepl("hyp",tidy$EVTYPE)] <- "extreme cold/wind chill"
tidy$EVTYPE[grepl("ice storm",tidy$EVTYPE)] <- "istorm"
tidy$EVTYPE[grepl("icestorm",tidy$EVTYPE)] <- "istorm"
tidy$EVTYPE[grepl("ice/snow",tidy$EVTYPE)] <- "istorm"
tidy$EVTYPE[grepl("ice  snow",tidy$EVTYPE)] <- "istorm"
tidy$EVTYPE[grepl("ice pellets",tidy$EVTYPE)] <- "istorm"
tidy$EVTYPE[grepl("^heavy snow",tidy$EVTYPE)] <- "heavy snow"
tidy$EVTYPE[grepl("high surf",tidy$EVTYPE)] <- "high surf"
tidy$EVTYPE[grepl("heavy surf",tidy$EVTYPE)] <- "high surf"
tidy$EVTYPE[grepl("high seas",tidy$EVTYPE)] <- "high surf"
tidy$EVTYPE[grepl("high tides",tidy$EVTYPE)] <- "high surf"
tidy$EVTYPE[grepl("high water",tidy$EVTYPE)] <- "high surf"
tidy$EVTYPE[grepl("high waves",tidy$EVTYPE)] <- "high surf"
tidy$EVTYPE[grepl("high swells",tidy$EVTYPE)] <- "high surf"
tidy$EVTYPE[grepl("heavy swells",tidy$EVTYPE)] <- "high surf"
tidy$EVTYPE[grepl("high surf",tidy$EVTYPE)] <- "high surf"
tidy$EVTYPE[grepl("heavy surf",tidy$EVTYPE)] <- "high surf"
tidy$EVTYPE[grepl("high seas",tidy$EVTYPE)] <- "high surf"
tidy$EVTYPE[grepl("high tides",tidy$EVTYPE)] <- "high surf"
tidy$EVTYPE[grepl("high water",tidy$EVTYPE)] <- "high surf"
tidy$EVTYPE[grepl("high waves",tidy$EVTYPE)] <- "high surf"
tidy$EVTYPE[grepl("high swells",tidy$EVTYPE)] <- "high surf"
tidy$EVTYPE[grepl("hazardous surf",tidy$EVTYPE)] <- "high surf"
tidy$EVTYPE[grepl("high  swells",tidy$EVTYPE)] <- "high surf"
tidy$EVTYPE[grepl("heavy rain",tidy$EVTYPE)] <- "heavy rain"
tidy$EVTYPE[grepl("high wind",tidy$EVTYPE)] <- "high wind"
tidy$EVTYPE[grepl("high  wind",tidy$EVTYPE)] <- "high wind"
tidy$EVTYPE[grepl("heavy rain",tidy$EVTYPE)] <- "heavy rain"
tidy$EVTYPE[grepl("hvy rain",tidy$EVTYPE)] <- "heavy rain"
tidy$EVTYPE[grepl("lake flood",tidy$EVTYPE)] <- "lakeshore flood"
tidy$EVTYPE[grepl("flash",tidy$EVTYPE)] <- "flash"
tidy$EVTYPE[grepl("flood",tidy$EVTYPE)] <- "flood"
tidy$EVTYPE[grepl("flash",tidy$EVTYPE)] <- "flash flood"
tidy$EVTYPE[grepl("^ic",tidy$EVTYPE)] <- "frost/freeze"
tidy$EVTYPE[grepl("istorm",tidy$EVTYPE)] <- "ice storm"
tidy$EVTYPE[grepl("very warm",tidy$EVTYPE)] <- "heat"
tidy$EVTYPE[grepl("very dry",tidy$EVTYPE)] <- "heat"
tidy$EVTYPE[grepl("^hot",tidy$EVTYPE)] <- "heat"
tidy$EVTYPE[grepl("^heat",tidy$EVTYPE)] <- "heat"
tidy$EVTYPE[grepl("high temperature",tidy$EVTYPE)] <- "heat"
tidy$EVTYPE[grepl("fog",tidy$EVTYPE)] <- "f fog"
tidy$EVTYPE[grepl("frost",tidy$EVTYPE)] <- "frost/freeze"
tidy$EVTYPE[grepl("freez",tidy$EVTYPE)] <- "frost/freeze"
tidy$EVTYPE[grepl("fog",tidy$EVTYPE)] <- "freezing fog"

tidy$EVTYPE[grepl("funnel",tidy$EVTYPE)] <- "funnel cloud"
tidy$EVTYPE[grepl("extreme wind",tidy$EVTYPE)] <- "extreme cold/wind chill"
tidy$EVTYPE[grepl("extreme cold",tidy$EVTYPE)] <- "extreme cold/wind chill"
tidy$EVTYPE[grepl("extreme/ cold",tidy$EVTYPE)] <- "extreme cold/wind chill"
tidy$EVTYPE[grepl("extreme heat",tidy$EVTYPE)] <- "excessive heat"
tidy$EVTYPE[grepl("excessive rain",tidy$EVTYPE)] <- "heavy rain"
tidy$EVTYPE[grepl("excessive prec",tidy$EVTYPE)] <- "heavy rain"
tidy$EVTYPE[grepl("excessive snow",tidy$EVTYPE)] <- "heavy snow"
tidy$EVTYPE[grepl("erosion",tidy$EVTYPE)] <- "coastal flood"
tidy$EVTYPE[grepl("dam break",tidy$EVTYPE)] <- "flood"
tidy$EVTYPE[grepl("glaze",tidy$EVTYPE)] <- "frost/freeze"
tidy$EVTYPE[grepl("gust",tidy$EVTYPE)] <- "dust devil"
tidy$EVTYPE[grepl("heavy lake",tidy$EVTYPE)] <- "lake-effect snow"
tidy$EVTYPE[grepl("heavy precipitation",tidy$EVTYPE)] <- "heavy rain"

tidy$EVTYPE[grepl("hail",tidy$EVTYPE)] <- "hail"
tidy$EVTYPE[grepl("duststorm",tidy$EVTYPE)] <- "dust storm"
tidy$EVTYPE[grepl("dust dev",tidy$EVTYPE)] <- "dust devil"
tidy$EVTYPE[grepl("drought",tidy$EVTYPE)] <- "drought"
tidy$EVTYPE[grepl("excessively dry",tidy$EVTYPE)] <- "drought"
tidy$EVTYPE[grepl("^cold wind",tidy$EVTYPE)] <- "cold/wind chill"
tidy$EVTYPE[grepl("^cold/",tidy$EVTYPE)] <- "cold/wind chill"
tidy$EVTYPE[grepl("^cold wea",tidy$EVTYPE)] <- "cold/wind chill"
tidy$EVTYPE[grepl("^cold temp",tidy$EVTYPE)] <- "cold/wind chill"
tidy$EVTYPE[grepl("cold$",tidy$EVTYPE)] <- "cold/wind chill"
tidy$EVTYPE[grepl("coas",tidy$EVTYPE)] <- "coastal flood"
tidy$EVTYPE[grepl("bli",tidy$EVTYPE)] <- "blizzard"
tidy$EVTYPE[grepl("aval",tidy$EVTYPE)] <- "avalanche"
tidy$EVTYPE[grepl("black",tidy$EVTYPE)] <- "frost/freeze"
tidy$EVTYPE[grepl("down",tidy$EVTYPE)] <- "heavy rain"
tidy$EVTYPE[grepl("dry mi",tidy$EVTYPE)] <- "high wind"
tidy$EVTYPE[grepl("low",tidy$EVTYPE)] <- "cold/wind chill"
tidy$EVTYPE[grepl("lsl",tidy$EVTYPE)] <- "debris flow"
tidy$EVTYPE[grepl("marine tstm wind",tidy$EVTYPE)] <- "marine thunderstorm wind"
tidy$EVTYPE[grepl("mud",tidy$EVTYPE)] <- "debris flow"
tidy$EVTYPE[grepl("rock",tidy$EVTYPE)] <- "debris flow"
tidy$EVTYPE[grepl("sml  fld",tidy$EVTYPE)] <- "flood"
tidy$EVTYPE[grepl("torrent",tidy$EVTYPE)] <- "heavy rain"
tidy$EVTYPE[grepl("whirlwind",tidy$EVTYPE)] <- "tornado"
tidy$EVTYPE[grepl("^wind",tidy$EVTYPE)] <- "high wind"
tidy$EVTYPE[grepl("breakup",tidy$EVTYPE)] <- "flood"

#remove observations that don't seem to fit any of the 48 categories or are not clear
bad <- c("","?","apache county", "astronomical high tide","cold  snow","cold  wet conditions","cold wave","cool  wet","drowning","dust","excessive wetness", "falling snow ice","gradient wind","heavy mix", "heavy seas","high","late season snow","light snow","light snowfall","marine accident","marine mishap","mixed precip","mixed precipitation","non- wind damage","non-tstm wind","non tstm wind","other","turbulence")
tidy <- tidy[!tidy$EVTYPE %in% bad,]

At last the event types are all included in the defined categories. However, not all the 48 categories appear in the final data set. This is presumably because a few of the categories were not reported to have caused damage, and so were removed at the beginning of the process.

u <- unique(tidy$EVTYPE)
u
##  [1] "tornado"                  "thunderstorm wind"       
##  [3] "hail"                     "ice storm"               
##  [5] "winter storm"             "hurricane(typhoon)"      
##  [7] "heavy rain"               "lightning"               
##  [9] "freezing fog"             "rip current"             
## [11] "flash flood"              "heat"                    
## [13] "high wind"                "cold/wind chill"         
## [15] "flood"                    "waterspout"              
## [17] "extreme cold/wind chill"  "frost/freeze"            
## [19] "avalanche"                "high surf"               
## [21] "heavy snow"               "dust storm"              
## [23] "sleet"                    "dust devil"              
## [25] "excessive heat"           "wildfire"                
## [27] "debris flow"              "funnel cloud"            
## [29] "strong wind"              "blizzard"                
## [31] "storm surge/tide"         "tropical storm"          
## [33] "winter weather"           "lake-effect snow"        
## [35] "coastal flood"            "seiche"                  
## [37] "volcanic ash"             "marine thunderstorm wind"
## [39] "tropical depression"      "tsunami"                 
## [41] "marine strong wind"       "dense smoke"

Now the variable EVTYPE need to be reclassed as a factor variable using the event types remaining after the above process as the levels.

tidy$EVTYPE <- as.factor(tidy$EVTYPE)

In the original dataset there are two variables relating to population health:FATALITIES and INJURIES. Both are recorded as number of incidents. In order to measure overall damage to population health a new variable “health” was created, as their sum.

#create variable health  -total number of fatalities and injuries
tidy$health <- tidy$FATALITIES + tidy$INJURIES

Economic damage is recorded in four variables in the original dataset, PROPDMG - property damage (numeric), PROPDMGEXP - units of property damage, CROPDMG - damage to crops (numeric), and CROPDMGEXP - units of damage to crops. In order to measure overall economic damage, the PROPDMG and CROPDMG values all need to be converted into values measured in thousands of dollars. The PROPDMGEXP and CROPDMGEXP variables with value k or K are in thousands of dollars, m or M are in millions and B are in billions. There is no information given as to the meaning of other values of these variables, but as the following tables show, there are relatively few occurences of other values, and so they were removed from the data set.

table(tidy$PROPDMGEXP)
## 
##             -      ?      +      0      1      2      3      4      5 
##  11534      1      0      5    210      0      1      1      4     18 
##      6      7      8      B      h      H      K      m      M 
##      3      3      0     40      1      6 231237      7  11315
table(tidy$CROPDMGEXP)
## 
##             ?      0      2      B      k      K      m      M 
## 152451      6     17      0      7     21  99901      1   1982
#process damages into thousands of dollars
units <- c("k","K","m","M","B")
tidy <- tidy[tidy$CROPDMGEXP %in% units & tidy$PROPDMGEXP %in% units,]

Now the economic damage can be converted to thousands of dollars and a new variable “econ” is introduced as the sum of PROPDMG and CROPDMG in thousands of dollars.

#multiply by a thousand if the value is in millions and by a million if the value is in billions (to make all values in thousands)
for (i in seq_along(tidy$PROPDMG)){
        if (tidy$PROPDMGEXP[i]=="m" | tidy$PROPDMGEXP[i]=="M"){
                tidy$PROPDMG[i]<-1000*tidy$PROPDMG[i]
        }
        else if (tidy$PROPDMGEXP[i]=="B"){
                tidy$PROPDMG[i]<-tidy$PROPDMG[i]*1000000
        }
}

for (i in seq_along(tidy$CROPDMG)){
        if (tidy$CROPDMGEXP[i]=="m" | tidy$CROPDMGEXP[i]=="M"){
                tidy$CROPDMG[i]<-1000*tidy$CROPDMG[i]
        }
        else if (tidy$CROPDMGEXP[i]=="B"){
                tidy$CROPDMG[i]<-tidy$CROPDMG[i]*1000000
        }
}
#create variable econ -total economic damage
tidy$econ <- tidy$CROPDMG + tidy$PROPDMG

Tidying up the dates variable:

library(lubridate)
## Warning: package 'lubridate' was built under R version 3.2.2
date <- as.character(tidy$BGN_DATE)
date <- strsplit(date," ")
date <- sapply(date, function(x){x[[1]][1]})
date <- mdy(date)
tidy$date <- date

The observations remaining in the processed dataset range from 1993-01-04 to 2011-11-30.

min(date)
## [1] "1993-01-04 UTC"
max(date)
## [1] "2011-11-30 UTC"

Results

Total number of casualties (fatalities plus injuries) by event type

The following figure shows the total number of deaths and injuries caused by each event type between 1993-01-04 and 2011-11-30.

totHealth <- with(tidy,tapply(health,EVTYPE,sum))
number <- totHealth
names(number)<-seq(42)
barplot(number,xlab="event",ylab="number",main="Total harm to population health by event type")

Event number 34 and event number 12 clearly stand out as the most harmful types of events.
These events and total numbers of casualties are:

totHealth[34]
## tornado 
##   13049
totHealth[12]
## flood 
##  6763

Other events causing less harm, are:

sort(totHealth,decreasing=TRUE)[3:10]
##  thunderstorm wind          ice storm               heat 
##               2018               1629               1379 
##          lightning        flash flood     excessive heat 
##               1183               1081               1070 
## hurricane(typhoon)           wildfire 
##               1019                642

Total economic damage by event type

The following figure show the total damage to property and crops caused by each event type (in thousands of dollars).

totEcon <- with(tidy,tapply(econ,EVTYPE,sum))
number <- totEcon
names(number)<-seq(42)
barplot(number,xlab="event",ylab="total damage",main="Total economic damage by event type")

Event type 12 clearly stands out as the event causing the greatest economic damage, followed by event 22, and then event 34. Events 16,11 and also 33,31 and 23 are non trivial . The events and the total damage in thousands of dollars are:

head(sort(totEcon,decreasing=TRUE),8)
##              flood hurricane(typhoon)            tornado 
##          148545617           44330001           18122666 
##               hail        flash flood          ice storm 
##           10021701            9223297            5925147 
##  thunderstorm wind   storm surge/tide 
##            5512876            4644413