Storm data recorded by U.S. National Oceanic and Atmospheric Administration (NOAA) was explored in order to identify the most harmful storm events across the US. Observations between January 1993 and November 2011 were retained after processing the dataset. The total number of casualties (deaths and injuries) was calculated for each type of storm event as a measure of harm to population health. The total damage (in thousands of dollars) to property and crops was calculated for each type of storm event as a measure of the economic consequences of the events. The events causing the greatest harm to population health were found to be tornadoes and floods. The events with the largest economic consequences were found to be floods, hurricanes and tornadoes.
To load data from compressed file:
data <- read.csv("repdata-data-StormData.csv.bz2")
Size of data: 902298 rows (observations) and 37 columns (variables).
dim(data)
## [1] 902297 37
The first few rows of the data look like this:
head(data)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO 0 0
## 2 TORNADO 0 0
## 3 TORNADO 0 0
## 4 TORNADO 0 0
## 5 TORNADO 0 0
## 6 TORNADO 0 0
## COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1 NA 0 14.0 100 3 0 0
## 2 NA 0 2.0 150 2 0 0
## 3 NA 0 0.1 123 2 0 0
## 4 NA 0 0.0 100 2 0 0
## 5 NA 0 0.0 150 2 0 0
## 6 NA 0 1.5 177 2 0 0
## INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1 15 25.0 K 0
## 2 0 2.5 K 0
## 3 2 25.0 K 0
## 4 2 2.5 K 0
## 5 2 2.5 K 0
## 6 6 2.5 K 0
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3040 8812 3051 8806 1
## 2 3042 8755 0 0 2
## 3 3340 8742 0 0 3
## 4 3458 8626 0 0 4
## 5 3412 8642 0 0 5
## 6 3450 8748 0 0 6
There are 48 event types defined in NWSI 10-1605 part 2.1.1.
There are 985 types of events recorded in the data frame.
length(unique(data$EVTYPE))
## [1] 985
For consistency, the event types must be edited and organised into the 48 types. Firstly, any events which caused no damage to population health or economic damage are irrelevant to this study, and so were removed. Next, any observations recording summary or monthly data are not events, and were also removed. Thereafter, unnecessary adjectives or whitespace in the names of the events were removed, allowing some events of the same type to be united.
#copy raw data frame into new data frame to be tidied
tidy <- data
#remove all observations that had no recorded health or economic damage
tidy <- tidy[tidy$FATALITIES > 0 | tidy$INJURIES > 0 | tidy$PROPDMG > 0 | tidy$CROPDMG > 0,]
#all types as lower case characters
tidy$EVTYPE <- tolower(tidy$EVTYPE)
#remove rows containing "summary" or "monthly"
tidy <- tidy[!grepl("summary",tidy$EVTYPE),]
tidy <- tidy[!grepl("monthly",tidy$EVTYPE),]
#combine categories into proper event types
tidy$EVTYPE <- gsub("urban","",tidy$EVTYPE)
tidy$EVTYPE <- gsub("unseasonal","",tidy$EVTYPE)
tidy$EVTYPE <- gsub("unseasonably","",tidy$EVTYPE)
tidy$EVTYPE <- gsub("unseasonable","",tidy$EVTYPE)
tidy$EVTYPE <- gsub("unusually","",tidy$EVTYPE)
tidy$EVTYPE <- gsub("unusual","",tidy$EVTYPE)
tidy$EVTYPE <- gsub("severe","",tidy$EVTYPE)
tidy$EVTYPE <- gsub("small","",tidy$EVTYPE)
tidy$EVTYPE <- gsub("seasonal","",tidy$EVTYPE)
tidy$EVTYPE <- gsub("stream","",tidy$EVTYPE)
tidy$EVTYPE <- gsub("and","",tidy$EVTYPE)
tidy$EVTYPE <- gsub("river","",tidy$EVTYPE)
tidy$EVTYPE <- gsub("record","",tidy$EVTYPE)
tidy$EVTYPE <- gsub("prolonged","",tidy$EVTYPE)
tidy$EVTYPE <- gsub("prolong","",tidy$EVTYPE)
tidy$EVTYPE <- gsub("gusty","",tidy$EVTYPE)
tidy$EVTYPE <- gsub("early","",tidy$EVTYPE)
tidy$EVTYPE <- gsub("bitter","",tidy$EVTYPE)
tidy$EVTYPE <- gsub("blowing","",tidy$EVTYPE)
tidy$EVTYPE <- gsub("/"," ",tidy$EVTYPE)
tidy$EVTYPE <- gsub("damaging","",tidy$EVTYPE)
tidy$EVTYPE <- gsub("hard","",tidy$EVTYPE)
tidy$EVTYPE <- trimws(tidy$EVTYPE)
This is still not enough. The number of types of events is still much larger than 48.
length(unique(tidy$EVTYPE))
## [1] 404
In order to unify more types of events it was necessary to do string searches to look for synonyms or spelling mistakes and rename them with an event name from the list of the 48 defined types. Whilst searching through the names of the events, some of them were found to be too ambiguous and were defined as “bad”and subsequently removed.
tidy$EVTYPE[grepl("winter we",tidy$EVTYPE)] <- "winter weather"
tidy$EVTYPE[grepl("wintry",tidy$EVTYPE)] <- "winter weather"
tidy$EVTYPE[grepl("winter s",tidy$EVTYPE)] <- "winter storm"
tidy$EVTYPE[grepl("fire",tidy$EVTYPE)] <- "wildfire"
tidy$EVTYPE[grepl("wet microburst",tidy$EVTYPE)] <- "heavy rain"
tidy$EVTYPE[grepl("warm",tidy$EVTYPE)] <- "heat"
tidy$EVTYPE[grepl("spout",tidy$EVTYPE)] <- "waterspout"
tidy$EVTYPE[grepl("volcanic",tidy$EVTYPE)] <- "volcanic ash"
tidy$EVTYPE[grepl("typhoon",tidy$EVTYPE)] <- "hurricane(typhoon)"
tidy$EVTYPE[grepl("tropical storm",tidy$EVTYPE)] <- "tropical storm"
tidy$EVTYPE[grepl("^tstm",tidy$EVTYPE)] <- "thunderstorm wind"
tidy$EVTYPE[grepl("thundersnow",tidy$EVTYPE)] <- "heavy snow"
tidy$EVTYPE[grepl("^thu",tidy$EVTYPE)] <- "thunderstorm wind"
tidy$EVTYPE[grepl("^tunder",tidy$EVTYPE)] <- "thunderstorm wind"
tidy$EVTYPE[grepl("torn",tidy$EVTYPE)] <- "tornado"
tidy$EVTYPE[grepl("tidal",tidy$EVTYPE)] <- "coastal flood"
tidy$EVTYPE[grepl("^strong wind",tidy$EVTYPE)] <- "strong wind"
tidy$EVTYPE[grepl("storm surge",tidy$EVTYPE)] <- "storm surge/tide"
tidy$EVTYPE[grepl("snowmelt",tidy$EVTYPE)] <- "flood"
tidy$EVTYPE[grepl("storm force",tidy$EVTYPE)] <- "strong wind"
tidy$EVTYPE[grepl("rogue",tidy$EVTYPE)] <- "storm surge/tide"
tidy$EVTYPE[grepl("rough",tidy$EVTYPE)] <- "high surf"
tidy$EVTYPE[grepl("sleet",tidy$EVTYPE)] <- "sleet"
tidy$EVTYPE[grepl("^rain",tidy$EVTYPE)] <- "heavy rain"
tidy$EVTYPE[grepl("^snow",tidy$EVTYPE)] <- "heavy snow"
tidy$EVTYPE[grepl("heavy wet snow",tidy$EVTYPE)] <- "heavy snow"
tidy$EVTYPE[grepl("heavy shower",tidy$EVTYPE)] <- "heavy rain"
tidy$EVTYPE[grepl("rip",tidy$EVTYPE)] <- "rip current"
tidy$EVTYPE[grepl("rapid",tidy$EVTYPE)] <- "coastal flood"
tidy$EVTYPE[grepl("microburst",tidy$EVTYPE)] <- "high wind"
tidy$EVTYPE[grepl("lightning",tidy$EVTYPE)] <- "lightning"
tidy$EVTYPE[grepl("lighting",tidy$EVTYPE)] <- "lightning"
tidy$EVTYPE[grepl("ligntning",tidy$EVTYPE)] <- "lightning"
tidy$EVTYPE[grepl("lake effect",tidy$EVTYPE)] <- "lake-effect snow"
tidy$EVTYPE[grepl("hurricane",tidy$EVTYPE)] <- "hurricane(typhoon)"
tidy$EVTYPE[grepl("hyp",tidy$EVTYPE)] <- "extreme cold/wind chill"
tidy$EVTYPE[grepl("ice storm",tidy$EVTYPE)] <- "istorm"
tidy$EVTYPE[grepl("icestorm",tidy$EVTYPE)] <- "istorm"
tidy$EVTYPE[grepl("ice/snow",tidy$EVTYPE)] <- "istorm"
tidy$EVTYPE[grepl("ice snow",tidy$EVTYPE)] <- "istorm"
tidy$EVTYPE[grepl("ice pellets",tidy$EVTYPE)] <- "istorm"
tidy$EVTYPE[grepl("^heavy snow",tidy$EVTYPE)] <- "heavy snow"
tidy$EVTYPE[grepl("high surf",tidy$EVTYPE)] <- "high surf"
tidy$EVTYPE[grepl("heavy surf",tidy$EVTYPE)] <- "high surf"
tidy$EVTYPE[grepl("high seas",tidy$EVTYPE)] <- "high surf"
tidy$EVTYPE[grepl("high tides",tidy$EVTYPE)] <- "high surf"
tidy$EVTYPE[grepl("high water",tidy$EVTYPE)] <- "high surf"
tidy$EVTYPE[grepl("high waves",tidy$EVTYPE)] <- "high surf"
tidy$EVTYPE[grepl("high swells",tidy$EVTYPE)] <- "high surf"
tidy$EVTYPE[grepl("heavy swells",tidy$EVTYPE)] <- "high surf"
tidy$EVTYPE[grepl("high surf",tidy$EVTYPE)] <- "high surf"
tidy$EVTYPE[grepl("heavy surf",tidy$EVTYPE)] <- "high surf"
tidy$EVTYPE[grepl("high seas",tidy$EVTYPE)] <- "high surf"
tidy$EVTYPE[grepl("high tides",tidy$EVTYPE)] <- "high surf"
tidy$EVTYPE[grepl("high water",tidy$EVTYPE)] <- "high surf"
tidy$EVTYPE[grepl("high waves",tidy$EVTYPE)] <- "high surf"
tidy$EVTYPE[grepl("high swells",tidy$EVTYPE)] <- "high surf"
tidy$EVTYPE[grepl("hazardous surf",tidy$EVTYPE)] <- "high surf"
tidy$EVTYPE[grepl("high swells",tidy$EVTYPE)] <- "high surf"
tidy$EVTYPE[grepl("heavy rain",tidy$EVTYPE)] <- "heavy rain"
tidy$EVTYPE[grepl("high wind",tidy$EVTYPE)] <- "high wind"
tidy$EVTYPE[grepl("high wind",tidy$EVTYPE)] <- "high wind"
tidy$EVTYPE[grepl("heavy rain",tidy$EVTYPE)] <- "heavy rain"
tidy$EVTYPE[grepl("hvy rain",tidy$EVTYPE)] <- "heavy rain"
tidy$EVTYPE[grepl("lake flood",tidy$EVTYPE)] <- "lakeshore flood"
tidy$EVTYPE[grepl("flash",tidy$EVTYPE)] <- "flash"
tidy$EVTYPE[grepl("flood",tidy$EVTYPE)] <- "flood"
tidy$EVTYPE[grepl("flash",tidy$EVTYPE)] <- "flash flood"
tidy$EVTYPE[grepl("^ic",tidy$EVTYPE)] <- "frost/freeze"
tidy$EVTYPE[grepl("istorm",tidy$EVTYPE)] <- "ice storm"
tidy$EVTYPE[grepl("very warm",tidy$EVTYPE)] <- "heat"
tidy$EVTYPE[grepl("very dry",tidy$EVTYPE)] <- "heat"
tidy$EVTYPE[grepl("^hot",tidy$EVTYPE)] <- "heat"
tidy$EVTYPE[grepl("^heat",tidy$EVTYPE)] <- "heat"
tidy$EVTYPE[grepl("high temperature",tidy$EVTYPE)] <- "heat"
tidy$EVTYPE[grepl("fog",tidy$EVTYPE)] <- "f fog"
tidy$EVTYPE[grepl("frost",tidy$EVTYPE)] <- "frost/freeze"
tidy$EVTYPE[grepl("freez",tidy$EVTYPE)] <- "frost/freeze"
tidy$EVTYPE[grepl("fog",tidy$EVTYPE)] <- "freezing fog"
tidy$EVTYPE[grepl("funnel",tidy$EVTYPE)] <- "funnel cloud"
tidy$EVTYPE[grepl("extreme wind",tidy$EVTYPE)] <- "extreme cold/wind chill"
tidy$EVTYPE[grepl("extreme cold",tidy$EVTYPE)] <- "extreme cold/wind chill"
tidy$EVTYPE[grepl("extreme/ cold",tidy$EVTYPE)] <- "extreme cold/wind chill"
tidy$EVTYPE[grepl("extreme heat",tidy$EVTYPE)] <- "excessive heat"
tidy$EVTYPE[grepl("excessive rain",tidy$EVTYPE)] <- "heavy rain"
tidy$EVTYPE[grepl("excessive prec",tidy$EVTYPE)] <- "heavy rain"
tidy$EVTYPE[grepl("excessive snow",tidy$EVTYPE)] <- "heavy snow"
tidy$EVTYPE[grepl("erosion",tidy$EVTYPE)] <- "coastal flood"
tidy$EVTYPE[grepl("dam break",tidy$EVTYPE)] <- "flood"
tidy$EVTYPE[grepl("glaze",tidy$EVTYPE)] <- "frost/freeze"
tidy$EVTYPE[grepl("gust",tidy$EVTYPE)] <- "dust devil"
tidy$EVTYPE[grepl("heavy lake",tidy$EVTYPE)] <- "lake-effect snow"
tidy$EVTYPE[grepl("heavy precipitation",tidy$EVTYPE)] <- "heavy rain"
tidy$EVTYPE[grepl("hail",tidy$EVTYPE)] <- "hail"
tidy$EVTYPE[grepl("duststorm",tidy$EVTYPE)] <- "dust storm"
tidy$EVTYPE[grepl("dust dev",tidy$EVTYPE)] <- "dust devil"
tidy$EVTYPE[grepl("drought",tidy$EVTYPE)] <- "drought"
tidy$EVTYPE[grepl("excessively dry",tidy$EVTYPE)] <- "drought"
tidy$EVTYPE[grepl("^cold wind",tidy$EVTYPE)] <- "cold/wind chill"
tidy$EVTYPE[grepl("^cold/",tidy$EVTYPE)] <- "cold/wind chill"
tidy$EVTYPE[grepl("^cold wea",tidy$EVTYPE)] <- "cold/wind chill"
tidy$EVTYPE[grepl("^cold temp",tidy$EVTYPE)] <- "cold/wind chill"
tidy$EVTYPE[grepl("cold$",tidy$EVTYPE)] <- "cold/wind chill"
tidy$EVTYPE[grepl("coas",tidy$EVTYPE)] <- "coastal flood"
tidy$EVTYPE[grepl("bli",tidy$EVTYPE)] <- "blizzard"
tidy$EVTYPE[grepl("aval",tidy$EVTYPE)] <- "avalanche"
tidy$EVTYPE[grepl("black",tidy$EVTYPE)] <- "frost/freeze"
tidy$EVTYPE[grepl("down",tidy$EVTYPE)] <- "heavy rain"
tidy$EVTYPE[grepl("dry mi",tidy$EVTYPE)] <- "high wind"
tidy$EVTYPE[grepl("low",tidy$EVTYPE)] <- "cold/wind chill"
tidy$EVTYPE[grepl("lsl",tidy$EVTYPE)] <- "debris flow"
tidy$EVTYPE[grepl("marine tstm wind",tidy$EVTYPE)] <- "marine thunderstorm wind"
tidy$EVTYPE[grepl("mud",tidy$EVTYPE)] <- "debris flow"
tidy$EVTYPE[grepl("rock",tidy$EVTYPE)] <- "debris flow"
tidy$EVTYPE[grepl("sml fld",tidy$EVTYPE)] <- "flood"
tidy$EVTYPE[grepl("torrent",tidy$EVTYPE)] <- "heavy rain"
tidy$EVTYPE[grepl("whirlwind",tidy$EVTYPE)] <- "tornado"
tidy$EVTYPE[grepl("^wind",tidy$EVTYPE)] <- "high wind"
tidy$EVTYPE[grepl("breakup",tidy$EVTYPE)] <- "flood"
#remove observations that don't seem to fit any of the 48 categories or are not clear
bad <- c("","?","apache county", "astronomical high tide","cold snow","cold wet conditions","cold wave","cool wet","drowning","dust","excessive wetness", "falling snow ice","gradient wind","heavy mix", "heavy seas","high","late season snow","light snow","light snowfall","marine accident","marine mishap","mixed precip","mixed precipitation","non- wind damage","non-tstm wind","non tstm wind","other","turbulence")
tidy <- tidy[!tidy$EVTYPE %in% bad,]
At last the event types are all included in the defined categories. However, not all the 48 categories appear in the final data set. This is presumably because a few of the categories were not reported to have caused damage, and so were removed at the beginning of the process.
u <- unique(tidy$EVTYPE)
u
## [1] "tornado" "thunderstorm wind"
## [3] "hail" "ice storm"
## [5] "winter storm" "hurricane(typhoon)"
## [7] "heavy rain" "lightning"
## [9] "freezing fog" "rip current"
## [11] "flash flood" "heat"
## [13] "high wind" "cold/wind chill"
## [15] "flood" "waterspout"
## [17] "extreme cold/wind chill" "frost/freeze"
## [19] "avalanche" "high surf"
## [21] "heavy snow" "dust storm"
## [23] "sleet" "dust devil"
## [25] "excessive heat" "wildfire"
## [27] "debris flow" "funnel cloud"
## [29] "strong wind" "blizzard"
## [31] "storm surge/tide" "tropical storm"
## [33] "winter weather" "lake-effect snow"
## [35] "coastal flood" "seiche"
## [37] "volcanic ash" "marine thunderstorm wind"
## [39] "tropical depression" "tsunami"
## [41] "marine strong wind" "dense smoke"
Now the variable EVTYPE need to be reclassed as a factor variable using the event types remaining after the above process as the levels.
tidy$EVTYPE <- as.factor(tidy$EVTYPE)
In the original dataset there are two variables relating to population health:FATALITIES and INJURIES. Both are recorded as number of incidents. In order to measure overall damage to population health a new variable “health” was created, as their sum.
#create variable health -total number of fatalities and injuries
tidy$health <- tidy$FATALITIES + tidy$INJURIES
Economic damage is recorded in four variables in the original dataset, PROPDMG - property damage (numeric), PROPDMGEXP - units of property damage, CROPDMG - damage to crops (numeric), and CROPDMGEXP - units of damage to crops. In order to measure overall economic damage, the PROPDMG and CROPDMG values all need to be converted into values measured in thousands of dollars. The PROPDMGEXP and CROPDMGEXP variables with value k or K are in thousands of dollars, m or M are in millions and B are in billions. There is no information given as to the meaning of other values of these variables, but as the following tables show, there are relatively few occurences of other values, and so they were removed from the data set.
table(tidy$PROPDMGEXP)
##
## - ? + 0 1 2 3 4 5
## 11534 1 0 5 210 0 1 1 4 18
## 6 7 8 B h H K m M
## 3 3 0 40 1 6 231237 7 11315
table(tidy$CROPDMGEXP)
##
## ? 0 2 B k K m M
## 152451 6 17 0 7 21 99901 1 1982
#process damages into thousands of dollars
units <- c("k","K","m","M","B")
tidy <- tidy[tidy$CROPDMGEXP %in% units & tidy$PROPDMGEXP %in% units,]
Now the economic damage can be converted to thousands of dollars and a new variable “econ” is introduced as the sum of PROPDMG and CROPDMG in thousands of dollars.
#multiply by a thousand if the value is in millions and by a million if the value is in billions (to make all values in thousands)
for (i in seq_along(tidy$PROPDMG)){
if (tidy$PROPDMGEXP[i]=="m" | tidy$PROPDMGEXP[i]=="M"){
tidy$PROPDMG[i]<-1000*tidy$PROPDMG[i]
}
else if (tidy$PROPDMGEXP[i]=="B"){
tidy$PROPDMG[i]<-tidy$PROPDMG[i]*1000000
}
}
for (i in seq_along(tidy$CROPDMG)){
if (tidy$CROPDMGEXP[i]=="m" | tidy$CROPDMGEXP[i]=="M"){
tidy$CROPDMG[i]<-1000*tidy$CROPDMG[i]
}
else if (tidy$CROPDMGEXP[i]=="B"){
tidy$CROPDMG[i]<-tidy$CROPDMG[i]*1000000
}
}
#create variable econ -total economic damage
tidy$econ <- tidy$CROPDMG + tidy$PROPDMG
Tidying up the dates variable:
library(lubridate)
## Warning: package 'lubridate' was built under R version 3.2.2
date <- as.character(tidy$BGN_DATE)
date <- strsplit(date," ")
date <- sapply(date, function(x){x[[1]][1]})
date <- mdy(date)
tidy$date <- date
The observations remaining in the processed dataset range from 1993-01-04 to 2011-11-30.
min(date)
## [1] "1993-01-04 UTC"
max(date)
## [1] "2011-11-30 UTC"
The following figure shows the total number of deaths and injuries caused by each event type between 1993-01-04 and 2011-11-30.
totHealth <- with(tidy,tapply(health,EVTYPE,sum))
number <- totHealth
names(number)<-seq(42)
barplot(number,xlab="event",ylab="number",main="Total harm to population health by event type")
Event number 34 and event number 12 clearly stand out as the most harmful types of events.
These events and total numbers of casualties are:
totHealth[34]
## tornado
## 13049
totHealth[12]
## flood
## 6763
Other events causing less harm, are:
sort(totHealth,decreasing=TRUE)[3:10]
## thunderstorm wind ice storm heat
## 2018 1629 1379
## lightning flash flood excessive heat
## 1183 1081 1070
## hurricane(typhoon) wildfire
## 1019 642
The following figure show the total damage to property and crops caused by each event type (in thousands of dollars).
totEcon <- with(tidy,tapply(econ,EVTYPE,sum))
number <- totEcon
names(number)<-seq(42)
barplot(number,xlab="event",ylab="total damage",main="Total economic damage by event type")
Event type 12 clearly stands out as the event causing the greatest economic damage, followed by event 22, and then event 34. Events 16,11 and also 33,31 and 23 are non trivial . The events and the total damage in thousands of dollars are:
head(sort(totEcon,decreasing=TRUE),8)
## flood hurricane(typhoon) tornado
## 148545617 44330001 18122666
## hail flash flood ice storm
## 10021701 9223297 5925147
## thunderstorm wind storm surge/tide
## 5512876 4644413