Storms and other severe weather events can have a significant impact on human lives by causing both injury and death as well as have an economic impact by destroying property, buildings, trees, farm animals, and other agricultural goods. In this report, we explore the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. The database tracks characteristics of all reported storm events including the date, time, location, number of people injured, and number of people killed; it also contains estimates for the value of property and agriculture damaged by each weather event. Data are available from 1950 through 2011. The questions investigated in this report are: what types of storms have had the most significant impact on public health (the most human injuries and deaths); and what types of storms have had the most significant economic impact (the greatest monetary damage). Based on the analysis, we show that among the top four most significant types of storms for both humans and economic goods are tornados and floods.
The data for this report come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. The data can be downloaded using the following link:
Storm Data [47 Mb]
The following code downloads and saves the data to the user’s working directory, then loads the data into R. The dplyr package is required for the analysis.
require(dplyr)
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
destfile="StormData.csv.bz2")
data<-read.csv("StormData.csv.bz2")
There is also some documentation of the database from the NOAA. The documentation shows how some of the variables are constructed and/or defined. The user may choose to read these documents; however, reading them is not necessary for understanding the remainder of the analysis. All essential information is provided in this report.
National Weather Service Storm Data Documentation
National Climatic Data Center Storm Events FAQ
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years are considered to be more complete.
Once the data set is loaded into R, we can see the structure of the data using:
str(data)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
The data set consists of 902297 observations. Each observation is one storm event. We use the terms “observation” and “event” interchangeably throughout this paper.
Since this is an analysis of types of storm events (coded in the variable EVTYPES), the names of the values within EVTYPE are cleaned up slightly by making them entirely lower case. This brings the number of unique values for EVTYPE down nearly 10%.
data$EVTYPE<-tolower(data$EVTYPE)
For a more rigorous analysis, the values for EVTYPE require further modifications, e.g. “winter mix”, “wintery mix”, “wintry mix”, “winter weather mix”, and “winter weather/mix” should be considered equivalent. For the purpose of this report, these modifications were not made.
First, we will investigate which types of storm events have had the greatest impact on humans. From str(data) and from the documentation files, we know that the variables FATALITIES and INJURIES represent the total number of reported human deaths and injuries respectively for each storm event.
We now create a new variable called totalHarmed which is the sum of the number of fatalities and of injuries for each event.
data$totalHarmed<-data$FATALITIES+data$INJURIES
Since we are interested in knowing the storm type (EVTYPE) that has caused the most harm to humans, we find the sum of totalHarmed within each EVTYPE and then reorder the data set with totalHarmed in descending order.
data<-arrange(data, EVTYPE)
EVTYPES<-unique(data$EVTYPE)
x<-tapply(data$totalHarmed, data$EVTYPE, sum, na.rm=TRUE)
x<-data.frame(EVTYPE=EVTYPES, totalHarmed=as.numeric(x))
x<-arrange(x, desc(totalHarmed))
From str(x) we see that there are 898 different types of storm events.
str(x)
## 'data.frame': 898 obs. of 2 variables:
## $ EVTYPE : Factor w/ 898 levels " high surf advisory",..: 758 116 779 154 418 243 138 387 685 888 ...
## $ totalHarmed: num 96979 8428 7461 7259 6046 ...
While it may be possible to represent all 898 types of storm events graphically, we will plot only the top six most harmful event types for humans. We are choosing six somewhat arbitrarily; this yields a nice, readable plot while also clearly showing which storm types are the most harmful for humans.
xtop<-x[1:6,]
barplot(xtop$totalHarmed,
col=rainbow(6, s = 1, v = 1, start = .5, end = .75, alpha = 1),
xlab="Event Type",
ylab="Number of People Injured or Killed",
main="Storm Events with the Greatest Human Impact"
)
legend("topright",
tolower(xtop$EVTYPE),
cex=.6,
fill=rainbow(6, s = 1, v = 1, start = .5, end = .75, alpha = 1)
)
Again, the top six most harmful storm event types for humans, in order of severity, are:
xtop$EVTYPE
## [1] tornado excessive heat tstm wind flood
## [5] lightning heat
## 898 Levels: high surf advisory coastal flood ... wnd
with the most injuries and deaths occurring as a result of:
xtop$EVTYPE[1]
## [1] tornado
## 898 Levels: high surf advisory coastal flood ... wnd
The next question we investigate is which storm event type has had the greatest economic impact. To analyze this, we consider PROPDMG and CROPDMG which provide the value of damage to property and agriculture respectively measured in US dollars. The variables PROPDMGEXP and CROPDMGEXP contain modifiers for the dollar values given in PROPDMG and CROPDMG; “K” represents thousands of dollars, “M” represents millions of dollars, and “B” represents billions of dollars. Some values within the modifier variables are comments or other symbols. These values are ignored.
To analyze the total economic impact of each storm type, two new variables are created: totalProp which represents the total value of property damage measured in thousands of dollars and totalCrop which represents the total value of agricultural damage measured in thousands of dollars.
data$totalProp<-0
data$totalProp[which(data$PROPDMGEXP=="K")]<-data$PROPDMG[which(data$PROPDMGEXP=="K")]
data$totalProp[which(data$PROPDMGEXP=="M")]<-data$PROPDMG[which(data$PROPDMGEXP=="M")]*1000
data$totalProp[which(data$PROPDMGEXP=="B")]<-data$PROPDMG[which(data$PROPDMGEXP=="B")]*1000000
data$totalCrop<-0
data$totalCrop[which(data$CROPDMGEXP=="K")]<-data$CROPDMG[which(data$CROPDMGEXP=="K")]
data$totalCrop[which(data$CROPDMGEXP=="M")]<-data$CROPDMG[which(data$CROPDMGEXP=="M")]*1000
data$totalCrop[which(data$CROPDMGEXP=="B")]<-data$CROPDMG[which(data$CROPDMGEXP=="B")]*1000000
Next, the values of totalProp and totalCrop are combined to form the value of totalDamage. This variable represents the total dollar value of all damage resulting from each observation measured in thousands of dollars.
data$totalDamage<-data$totalProp+data$totalCrop
As previously stated, we are interested in knowing what storm event type (EVTYPE) has had the greatest economic impact. To determine this, the values of totalDamage are summed within each EVTYPE. The event types are then sorted in decreasing order (with most severe first).
y<-tapply(data$totalDamage, data$EVTYPE, sum, na.rm=TRUE)
y<-data.frame(EVTYPE=EVTYPES, totalDamage=as.numeric(y))
y<-arrange(y, desc(totalDamage))
Again, from str(y) we know that there are 898 different types of storm events.
str(y)
## 'data.frame': 898 obs. of 2 variables:
## $ EVTYPE : Factor w/ 898 levels " high surf advisory",..: 154 372 758 599 212 138 84 363 529 387 ...
## $ totalDamage: num 1.50e+08 7.19e+07 5.73e+07 4.33e+07 1.88e+07 ...
Just as when we considered the impact of storm types on humans, we will plot only the top six most damaging event types, even though it may be possible to graphically represent all event types. To reiterate, we are choosing six somewhat arbitrarily; this yields a nice, readable plot while also clearly showing which storm types cause the most economic damage.
ytop<-y[1:6,]
barplot(ytop$totalDamage,
col=rainbow(6, s = 1, v = 1, start = .9, end = .1, alpha = 1),
xlab="Event Type",
ylab="Total Damage",
main="Storm Events with the Greatest Economic Impact"
)
legend("topright",
tolower(ytop$EVTYPE),
cex=.6,
fill=rainbow(6, s = 1, v = 1, start = .9, end = .1, alpha = 1)
)
The plot shows that the top six most harmful storm event types monetarily, in order of severity, are:
ytop$EVTYPE
## [1] flood hurricane/typhoon tornado storm surge
## [5] hail flash flood
## 898 Levels: high surf advisory coastal flood ... wnd
with the most severe economic impact occurring as a result of:
ytop$EVTYPE[1]
## [1] flood
## 898 Levels: high surf advisory coastal flood ... wnd
From the analysis, it was shown that the storm types with the most significant negative impact on humans in the United states were:
xtop
## EVTYPE totalHarmed
## 1 tornado 96979
## 2 excessive heat 8428
## 3 tstm wind 7461
## 4 flood 7259
## 5 lightning 6046
## 6 heat 3037
and that the storm types causing the greatest economic damage were:
ytop
## EVTYPE totalDamage
## 1 flood 150319678
## 2 hurricane/typhoon 71913713
## 3 tornado 57340614
## 4 storm surge 43323541
## 5 hail 18752904
## 6 flash flood 17562129
We can see that there is some overlap between these two subsets of storm event types. Tornados and floods are both among the top four most deadly storm events for human lives and for property/agricultural goods. These results also show that tornados have killed and injured more than ten times as many people as the next most significant storm event type and floods have caused more than twice as much monetary damage as the next most significant type of storm.
This paper does not seek to make any suggestions as far as preventing loss of lives or monetary damage due to various types of storm events. We merely show which weather incidents have inflicted the most devastation in the United States between 1950 and 2011.
While the results appear reasonable, namely that tornados and floods are among the most destructive, the actual counts for humans injured/killed and the economic dollar value of damage to property/crops should not be taken as precise measurements. This is because of the precision with which data was recorded within the NOAA database. EVTYPE has 985 unique values in the raw data set. Modifying these values to be entirely lower case brought the number of unique values to 898, which is a decrease of nearly 10%. However, several of the values clearly represent the same type of event (e.g. “urban and small stream”, “urban and small stream flood”, “urban and small stream floodin”, “urban/small stream flood”, “urban/sml stream fld”, etc.). Neither the data cleaning nor the analysis itself took the issue of spelling variations into account. Adjusting for spelling variation and further cleaning up the EVTYPE values could have a significant impact on the values for totalDamage and totalHarmed potentially quite notably increasing them. The increases caused as a result of merging like EVTYPE values are assumed not to change the ranking of severity of storm event types, but only the counts for total number of people harmed and for the total economic cost of the damage.