Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
The questions this report attempts to answer are as follows: 1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health? 2. Across the United States, which types of events have the greatest economic consequences?
As we will see in the Results section, in the United States we find that tornadoes have the most drastic affect on population health in terms of fatalities and injuries, while hurricanes have the most devastating economic affect in terms of property and crop damage.
To begin our review, we must first download the data set and load required packages into R:
if(!file.exists('StormData.csv.bz2')){
download.file("http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
destfile='StormData.csv.bz2')}
if(file.exists('StormData.csv.bz2')){
stormorig <- read.csv(bzfile('StormData.csv.bz2'), header = TRUE)}
library(ggplot2)
In addition to downloading the data set, it is also important to review the supplementary materials provided in the National Weather Service Instruction 10-1605 document (found under https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf as of the current date). While we do not require the document to perform the data analysis, explanatory sections of this report will refer to information found in that document.
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete. Specifically, from 1950 through 1954 only tornado events were recorded. From 1955 through 1995, only tornado, thunderstorm wind, and hail events either were keyed from the paper publications into digital data or extracted from the unformatted text files. It is only the data from 1996 onwards that record all 48 event types as defined in NWS Directive 10-1605. Because of this, our review only takes into consideration the data from 1996 onwards.
str(stormorig)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
##Reformat date column
stormorig$BGN_DATE<-as.Date(stormorig$BGN_DATE,format="%m/%d/%Y %H:%M:%S")
stormorig$YEAR<-as.numeric(format(stormorig$BGN_DATE,"%Y"))
##Remove data prior to 1996
stormfiltered <- stormorig[stormorig$YEAR >= 1996, ]
To answer the question of which event type most affects population health, we first subset the data to include only those events that have either a fatality or injury associated with that type of event. Note that the description of the severity of the various injuries (in terms of dollars required to recover) is not provided in the data set. While the case can be made that a fatality is more serious than an injury, or that perhaps an injury could be more financially costly than a fatality, for the purposes of this report we rate fatalities and injuries as equal.
##Subset to retain only those columns necessary or relevant
stormppl<-stormfiltered[,c("BGN_DATE","STATE","EVTYPE","FATALITIES","INJURIES")]
##Create new column to combine fatality+injury data for easier subsetting
stormppl$Combined<-stormppl$FATALITIES+stormppl$INJURIES
##Remove all rows for which combined column = 0
stormppl<- stormppl[stormppl$Combined > 0, ]
Per NWS Directive 10-1605, only 48 main EVTYPE categories are valid. Per review of the newly subsetted data, there are instances of the same categories both capitalized and lower case, various misspellings, and abbreviations that can be cleaned and combined in order to retain the data, rather than just removing all rows with a strictly invalid EVTYPE. We have also chosen to combine some valid distinct EVTYPES into logical groups, for instance the various types of floods.
##Only 48 valid EVTYPES. Combine EVTYPES into logical groups
stormppl$EVTYPE <- gsub(".\\/", " ", stormppl$EVTYPE)
stormppl$EVTYPE <- gsub(".*FLOOD.*|.*FLD.*", "FLOOD", stormppl$EVTYPE, ignore.case=TRUE)
stormppl$EVTYPE <- gsub(".*TSTM.*|.*THUNDERSTORM.*", "THUNDERSTORM", stormppl$EVTYPE, ignore.case=TRUE)
stormppl$EVTYPE <- gsub(".*WIND.*", "WIND", stormppl$EVTYPE, ignore.case=TRUE)
stormppl$EVTYPE <- gsub(".*FREEZ.*|.*GLAZE.*|.*ICY.*|.*ICE.*", "ICE", stormppl$EVTYPE, ignore.case=TRUE)
stormppl$EVTYPE <- gsub(".*WINTER.*|.*FROST.*|.*CHILL.*|.*WINTRY.*|.*COLD.*|.*EXPOSURE.*",
"COLD", stormppl$EVTYPE, ignore.case=TRUE)
stormppl$EVTYPE <- gsub(".*SNOW.*|.*BLIZZARD.*", "SNOW", stormppl$EVTYPE, ignore.case=TRUE)
stormppl$EVTYPE <- gsub(".*HEAT.*|.*WARM.*", "HEAT", stormppl$EVTYPE, ignore.case=TRUE)
stormppl$EVTYPE <- gsub(".*RIP CURRENT.*", "RIP CURRENT", stormppl$EVTYPE, ignore.case=TRUE)
stormppl$EVTYPE <- gsub(".*FIRE.*", "FIRE", stormppl$EVTYPE, ignore.case=TRUE)
stormppl$EVTYPE <- gsub(".*SURF.*", "SURF", stormppl$EVTYPE, ignore.case=TRUE)
stormppl$EVTYPE <- gsub(".*COASTAL.*", "COASTAL STORM", stormppl$EVTYPE, ignore.case=TRUE)
stormppl$EVTYPE <- gsub(".*HURRICAN.*|.*TYPHOON.*", "HURRICANE", stormppl$EVTYPE, ignore.case=TRUE)
stormppl$EVTYPE <- gsub(".*SEAS.*|.*SWELL.*|.*WAVE.*|.*WATER.*|.*SURG.*|.*SURF.*", "STORM SURGE",
stormppl$EVTYPE, ignore.case=TRUE)
stormppl$EVTYPE <- gsub(".*LAND.*|.*MUD.*", "LANDSLIDE", stormppl$EVTYPE, ignore.case=TRUE)
stormppl$EVTYPE <- gsub(".*DUST.*", "DUST", stormppl$EVTYPE, ignore.case=TRUE)
unique(stormppl$EVTYPE)
## [1] "TORNADO" "FLOOD" "COLD"
## [4] "THUNDERSTORM" "LIGHTNING" "HEAT"
## [7] "RIP CURRENT" "SNOW" "ICE"
## [10] "DUST" "FOG" "STORM SURGE"
## [13] "FIRE" "WIND" "HEAVY RAIN"
## [16] "Marine Accident" "AVALANCHE" "DRY MICROBURST"
## [19] "HAIL" "COASTAL STORM" "HURRICANE"
## [22] "TROPICAL STORM" "Torrential Rainfall" "MIXED PRECIP"
## [25] "LANDSLIDE" "SMALL HAIL" "FUNNEL CLOUD"
## [28] "OTHER" "DROWNING" "DENSE FOG"
## [31] "DROUGHT" "TSUNAMI"
We have not made the attempt to match the given EVTYPES exactly, but have instead chosen to form logical event type groupings. We now have an imperfect but useable data set for which we can determine the impact of weather event type on population health.
Similarly, in order to determine which weather event type has the greatest economic consequences, we first subset the data to include only those events that have a financial impact, as denoted in the data set as either property or crop damage. NWS Directive 10-1605 indicates that valid values in the damage exponent category are limited to K for thousand, M for million, and B for billion. Rather than attempt to guess at what the other values in these columns may or may not indicate, we have removed all invalid exponent values. Note that our processing changes all values into thousands.
##Subset for required columns
stormdmg<-stormfiltered[,c("BGN_DATE","STATE","EVTYPE","PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP")]
##Transform k/K, m/M, & b/B into numerical data - all other values are incorrectly used
##Final values will be in thousands.
KMB <- function(x){
if(x=="k") {return("1")}
else if(x=="K"){return("1")}
else if(x=="m"){return("1000")}
else if(x=="M"){return("1000")}
else if(x=="B"){return("1000000")}
else(return("0"))
}
options(scipen=999)
##Creates a new column (column 8) that uses the for-loop to populate damage amounts
stormdmg$PROPDMGamt<-as.numeric(sapply(stormdmg$PROPDMGEXP,KMB))
##Does the same (column 9) for crop damage
stormdmg$CROPDMGamt<-as.numeric(sapply(stormdmg$CROPDMGEXP,KMB))
##Multiplies our values and exponents
stormdmg$PROPDMGamt<-stormdmg$PROPDMG*stormdmg$PROPDMGamt
stormdmg$CROPDMGamt<-stormdmg$CROPDMG*stormdmg$CROPDMGamt
Because the dollar values in the data set are estimates of damage, and because of potential human error, we will now look for outliers in our data. It is presumed that Hurricane Katrina will have the largest economic impact as it is potentially the most devastating weather event in the U.S. in modern history.
##Look for outliers / miskeyed data
summary(stormdmg$PROPDMGamt) ##huge max value
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 0 0 561 1 115000000
stormdmg[which.max(stormdmg$PROPDMGamt),]
## BGN_DATE STATE EVTYPE PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 605953 2006-01-01 CA FLOOD 115 B 32.5 M
## PROPDMGamt CROPDMGamt
## 605953 115000000 32500
We indeed have an unusual outlier in our data - a flood event in California in 2006 is recorded in the data set as causing $115B in property damage. We assume this data is either miskeyed or miscoded, since an event of this economic impact would have been 3 1/2 times more devastating than Hurricane Katrina, and this event was not of that magnitude. Removing the value for this line, we now get Hurricane Katrina as the number one weather event in terms of both property and crop damage.
##Assuming data is miscoded, remove this value.
which.max(stormdmg$PROPDMGamt)
## [1] 357186
stormdmg[357186,8]=0
##Run again, Hurricane Katrina now max as expected
stormdmg[which.max(stormdmg$PROPDMGamt),]
## BGN_DATE STATE EVTYPE PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 577676 2005-08-29 LA STORM SURGE 31.3 B 0
## PROPDMGamt CROPDMGamt
## 577676 31300000 0
##Also max for crop damage
summary(stormdmg$CROPDMGamt)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 0.0 0.0 53.2 0.0 1510000.0
stormdmg[which.max(stormdmg$CROPDMGamt),]
## BGN_DATE STATE EVTYPE PROPDMG PROPDMGEXP CROPDMG
## 581537 2005-08-29 MS HURRICANE/TYPHOON 5.88 B 1.51
## CROPDMGEXP PROPDMGamt CROPDMGamt
## 581537 B 5880000 1510000
Once again we will remove all rows for which there was no property or crop damage recorded in the data set.
##Add columns to get total for both property and crop damage.
stormdmg$totdmg<-stormdmg$PROPDMGamt+stormdmg$CROPDMGamt
##Remove all columns for which prop+crop dmg = 0
stormdmgtotals <- stormdmg[stormdmg$totdmg > 0, ]
Again, per NWS Directive 10-1605, only 48 main EVTYPE categories are valid. Per review of the newly subsetted data, there are instances of the same categories both capitalized and lower case, various misspellings, and abbreviations that can be cleaned and combined in order to retain the data, rather than just removing all rows with a strictly invalid EVTYPE. We have also chosen to combine some valid distinct EVTYPES into logical groups.
##186 unique EVTYPES, only 48 are valid. Combine EVTYPES into logical groups
stormdmgtotals$EVTYPE <- gsub(".\\/", " ", stormdmgtotals$EVTYPE)
stormdmgtotals$EVTYPE <- gsub(".*FLOOD.*|.*FLD.*", "FLOOD", stormdmgtotals$EVTYPE, ignore.case=TRUE)
stormdmgtotals$EVTYPE <- gsub(".*TSTM.*|.*THUNDERSTORM.*", "THUNDERSTORM", stormdmgtotals$EVTYPE, ignore.case=TRUE)
stormdmgtotals$EVTYPE <- gsub(".*WIND.*", "WIND", stormdmgtotals$EVTYPE, ignore.case=TRUE)
stormdmgtotals$EVTYPE <- gsub(".*FREEZ.*|.*GLAZE.*|.*ICY.*|.*ICE.*", "ICE", stormdmgtotals$EVTYPE, ignore.case=TRUE)
stormdmgtotals$EVTYPE <- gsub(".*WINTER.*|.*FROST.*|.*CHILL.*|.*WINTRY.*|.*COLD.*|.*EXPOSURE.*",
"COLD", stormdmgtotals$EVTYPE, ignore.case=TRUE)
stormdmgtotals$EVTYPE <- gsub(".*SNOW.*|.*BLIZZARD.*", "SNOW", stormdmgtotals$EVTYPE, ignore.case=TRUE)
stormdmgtotals$EVTYPE <- gsub(".*HEAT.*|.*WARM.*", "HEAT", stormdmgtotals$EVTYPE, ignore.case=TRUE)
stormdmgtotals$EVTYPE <- gsub(".*RIP CURRENT.*", "RIP CURRENT", stormdmgtotals$EVTYPE, ignore.case=TRUE)
stormdmgtotals$EVTYPE <- gsub(".*FIRE.*", "FIRE", stormdmgtotals$EVTYPE, ignore.case=TRUE)
stormdmgtotals$EVTYPE <- gsub(".*SURF.*", "SURF", stormdmgtotals$EVTYPE, ignore.case=TRUE)
stormdmgtotals$EVTYPE <- gsub(".*COASTAL.*", "COASTAL STORM", stormdmgtotals$EVTYPE, ignore.case=TRUE)
stormdmgtotals$EVTYPE <- gsub(".*HURRICAN.*|.*TYPHOON.*", "HURRICANE", stormdmgtotals$EVTYPE, ignore.case=TRUE)
stormdmgtotals$EVTYPE <- gsub(".*SEAS.*|.*SWELL.*|.*WAVE.*|.*WATER.*|.*SURG.*|.*SURF.*", "STORM SURGE",
stormdmgtotals$EVTYPE, ignore.case=TRUE)
stormdmgtotals$EVTYPE <- gsub(".*LAND.*|.*MUD.*", "LANDSLIDE", stormdmgtotals$EVTYPE, ignore.case=TRUE)
stormdmgtotals$EVTYPE <- gsub(".*DUST.*", "DUST", stormdmgtotals$EVTYPE, ignore.case=TRUE)
stormdmgtotals$EVTYPE <- gsub(".*HAIL.*", "HAIL", stormdmgtotals$EVTYPE, ignore.case=TRUE)
stormdmgtotals$EVTYPE <- gsub(".*BURST.*", "MICROBURST", stormdmgtotals$EVTYPE, ignore.case=TRUE)
stormdmgtotals$EVTYPE <- gsub(".*RAIN.*", "RAIN", stormdmgtotals$EVTYPE, ignore.case=TRUE)
stormdmgtotals$EVTYPE <- gsub(".*FOG.*|.*SMOKE.*", "FOG", stormdmgtotals$EVTYPE, ignore.case=TRUE)
unique(stormdmgtotals$EVTYPE)
## [1] "COLD" "TORNADO"
## [3] "THUNDERSTORM" "WIND"
## [5] "FLOOD" "ICE"
## [7] "LIGHTNING" "HAIL"
## [9] "Other" "SNOW"
## [11] "FIRE" "STORM SURGE"
## [13] "DUST" "RAIN"
## [15] "Marine Accident" "MICROBURST"
## [17] "FOG" "TROPICAL STORM"
## [19] "Beach Erosion" "DROUGHT"
## [21] "LANDSLIDE" "COASTAL STORM"
## [23] "HURRICANE" "HEAT"
## [25] "OTHER" "Mixed Precipitation"
## [27] "DAM BREAK" "FUNNEL CLOUD"
## [29] "AVALANCHE" "SEICHE"
## [31] "ROCK SLIDE" "MIXED PRECIPITATION"
## [33] "RIP CURRENT" "VOLCANIC ASH"
## [35] "ASTRONOMICAL HIGH TIDE" "TROPICAL DEPRESSION"
## [37] "TSUNAMI" "ASTRONOMICAL LOW TIDE"
We have not made the attempt to match the given EVTYPES exactly, but have instead chosen to form logical event type groupings. We now have a second imperfect but useable data set for which we can determine the economic consequences of various weather events.
After aggregating the combined fatality and injury information by weather event type and taking the top 20 values in order to narrow our focus and produce a less cluttered graph, we can see that tornadoes by far have the greatest impact to human health in the United States in terms of fatality and injury combined.
##Aggregate data by EVTYPE
pophealth<-aggregate(Combined~EVTYPE,stormppl,sum)
pophealthnew<-pophealth[order(pophealth$Combined, decreasing = TRUE),]
pophealthnew <- pophealthnew[1:19,]
##Create a plot showing the top EVTYPES with regards to population health
par(mar=c(4,4,1,1))
g<-ggplot(pophealthnew,aes(EVTYPE,Combined, fill=EVTYPE))+geom_bar(stat="identity")
g<-g+theme(axis.text.x = element_text(angle=40,hjust=1))
g<-g+xlab("Event Type")+ylab("Fatalities+Injuries")+ggtitle("Impact on Population Health by Event Type, Top 20")
g
After aggregating the combined property and crop damage (in thousands) by weather event type and taking the top 20 values in order to narrow our focus and produce a less cluttered graph, we can see that hurricanes by far have the greatest economic impact in terms of property and crop damage in the United States.
##Aggregate data by EVTYPE
econdmg<-aggregate(totdmg~EVTYPE,stormdmgtotals,sum)
econdmgnew<-econdmg[order(econdmg$totdmg, decreasing = TRUE),]
econdmgnew <- econdmgnew[1:19,]
##Create a plot showing the top EVTYPES with regards to economic impact
par(mar=c(4,4,1,1))
g<-ggplot(econdmgnew,aes(EVTYPE,totdmg, fill=EVTYPE))+geom_bar(stat="identity")
g<-g+xlab("Event Type")+ylab("Property+Crop Damage, in Thousands")+ggtitle("Economic Impact by Event Type, Top 20")
g<-g+theme(axis.text.x = element_text(angle=40,hjust=1))
g