We explore the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. The database tracks characteristics of major storms and weather events that arised in the United States from 1950 to 2011, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage. The analysis of these events can potentially help communities and municipalities to prevent public health and economic problems caused by severe weather events. In this report, we first load the U.S. NOAA storm database and perform some cleaning. We then analyze the data in order to answer two questions, namely: i) which types of weather events are most harmful with respect to population health across the U.S. ?; and ii) which types of weather events have the greatest economic consequences across the U.S. ?
This analysis is conducted on the U.S. NOAA storm database. The database can be downloaded here. The documentation of the database is made available by the National Weather Service at this address and the FAQs are listed in this document.
We first read the data.
NOAA <- read.csv("repdata_data_StormData.csv.bz2", sep = ",", header = TRUE)
According to section 2.1 in the documentation, only 48 types of storm data events are permitted. These events are listed in Table 1 of the documentation. However, the EVTYPE variable has 985 levels. Mislabelling and duplications in the events type are responsible for most of the discrepancy. We start with doing some cleaning.
levels(NOAA$EVTYPE)[grep("AVALAN",
levels(NOAA$EVTYPE),
ignore.case = TRUE)] <- "AVALANCHE"
levels(NOAA$EVTYPE)[grep("BLIZZARD",
levels(NOAA$EVTYPE),
ignore.case = TRUE)] <- "BLIZZARD"
levels(NOAA$EVTYPE)[grep("COASTAL.*FLOOD",
levels(NOAA$EVTYPE),
ignore.case = TRUE)] <- "COASTAL FLOOD"
levels(NOAA$EVTYPE)[grep("FLASH FLOOD*|FLOOD*.FLASH",
levels(NOAA$EVTYPE),
ignore.case = TRUE)] <- "FLASH FLOOD"
levels(NOAA$EVTYPE)[grep("LAKE.* FLOOD",
levels(NOAA$EVTYPE),
ignore.case = TRUE)] <- "LAKESHORE FLOOD"
levels(NOAA$EVTYPE)[grep("^(?=.*FLOOD|.*FLD)(?!.*COASTAL)(?!.*LAKESHORE)(?!.*FLASH)",
levels(NOAA$EVTYPE),
ignore.case = TRUE,
perl = TRUE)] <- "FLOOD"
levels(NOAA$EVTYPE)[grep("FUNNEL",
levels(NOAA$EVTYPE),
ignore.case = TRUE)] <- "FUNNEL CLOUD"
levels(NOAA$EVTYPE)[grep("(RECORD|SEVERE|BITTER|EXTREME|EXCESSIVE).(COLD|WIND CHILL)",
levels(NOAA$EVTYPE),
ignore.case = TRUE)] <- "EXTREME COLD/WIND CHILL"
levels(NOAA$EVTYPE)[grep("^(?=.*COLD|WIND CHILL)(?!.*EXTREME)",
levels(NOAA$EVTYPE),
ignore.case = TRUE,
perl = TRUE)] <- "COLD/WIND CHILL"
levels(NOAA$EVTYPE)[grep("DEBRIS",
levels(NOAA$EVTYPE),
ignore.case = TRUE)] <- "DEBRIS FLOW"
levels(NOAA$EVTYPE)[grep("(FREEZING|ICE).FOG",
levels(NOAA$EVTYPE),
ignore.case = TRUE)] <- "FREEZING FOG"
levels(NOAA$EVTYPE)[grep("^(?=.*FOG)(?!.*FREEZING)",
levels(NOAA$EVTYPE),
ignore.case = TRUE,
perl = TRUE)] <- "DENSE FOG"
levels(NOAA$EVTYPE)[grep("SMOKE",
levels(NOAA$EVTYPE),
ignore.case = TRUE)] <- "DENSE SMOKE"
levels(NOAA$EVTYPE)[grep("DROUGHT|(ABNORMAL.*|EXCESSIVE.*) DRY",
levels(NOAA$EVTYPE),
ignore.case = TRUE)] <- "DROUGHT"
levels(NOAA$EVTYPE)[grep("DUST DEV",
levels(NOAA$EVTYPE),
ignore.case = TRUE)] <- "DUST DEVIL"
levels(NOAA$EVTYPE)[grep("DUST STORM|DUSTSTORM|BLOWING DUST",
levels(NOAA$EVTYPE),
ignore.case = TRUE)] <- "DUST STORM"
levels(NOAA$EVTYPE)[grep("(RECORD|EXTREME|EXCESSIVE|ABNORMAL).(HEAT|.*TEMP.*|WARM.*)",
levels(NOAA$EVTYPE),
ignore.case = TRUE)] <- "EXCESSIVE HEAT"
levels(NOAA$EVTYPE)[grep("^(?=.*HEAT)(?!.*EXCESSIVE)",
levels(NOAA$EVTYPE),
ignore.case = TRUE,
perl = TRUE)] <- "HEAT"
levels(NOAA$EVTYPE)[grep("FROST|FREEZE",
levels(NOAA$EVTYPE),
ignore.case = TRUE)] <- "FROST/FREEZE"
levels(NOAA$EVTYPE)[grep("MARINE HAIL",
levels(NOAA$EVTYPE),
ignore.case = TRUE)] <- "MARINE HAIL"
levels(NOAA$EVTYPE)[grep("^(?=.*HAIL)(?!.*MARINE)",
levels(NOAA$EVTYPE),
ignore.case = TRUE,
perl = TRUE)] <- "HAIL"
levels(NOAA$EVTYPE)[grep("HEAVY RAIN",
levels(NOAA$EVTYPE),
ignore.case = TRUE)] <- "HEAVY RAIN"
levels(NOAA$EVTYPE)[grep("HEAVY SNOW",
levels(NOAA$EVTYPE),
ignore.case = TRUE)] <- "HEAVY SNOW"
levels(NOAA$EVTYPE)[grep("(HIGH|HEAVY) SURF",
levels(NOAA$EVTYPE),
ignore.case = TRUE)] <- "HIGH SURF"
levels(NOAA$EVTYPE)[grep("MARINE HIGH WIND",
levels(NOAA$EVTYPE),
ignore.case = TRUE)] <- "MARINE HIGH WIND"
levels(NOAA$EVTYPE)[grep("^(?=.*HIGH.* WIND)(?!.*MARINE)",
levels(NOAA$EVTYPE),
ignore.case = TRUE,
perl = TRUE)] <- "HIGH WIND"
levels(NOAA$EVTYPE)[grep("HURRICANE|TYPHOON",
levels(NOAA$EVTYPE),
ignore.case = TRUE)] <- "HURRICANE (TYPHOON)"
levels(NOAA$EVTYPE)[grep("ICE",
levels(NOAA$EVTYPE),
ignore.case = TRUE)] <- "ICE STORM"
levels(NOAA$EVTYPE)[grep("LAKE.*SNOW",
levels(NOAA$EVTYPE),
ignore.case = TRUE)] <- "LAKE-EFFECT SNOW"
levels(NOAA$EVTYPE)[grep("LIGHTNING|LIGNTNING",
levels(NOAA$EVTYPE),
ignore.case = TRUE)] <- "LIGHTNING"
levels(NOAA$EVTYPE)[grep("MARINE STRONG WIND|MARINE MISHAP",
levels(NOAA$EVTYPE),
ignore.case = TRUE)] <- "MARINE STRONG WIND"
levels(NOAA$EVTYPE)[grep("^(?=.*STRONG WIND)(?!.*MARINE)",
levels(NOAA$EVTYPE),
ignore.case = TRUE,
perl = TRUE)] <- "HIGH WIND"
levels(NOAA$EVTYPE)[grep("MARINE (THUNDERSTORM|TSTM) WIND",
levels(NOAA$EVTYPE),
ignore.case = TRUE)] <- "MARINE THUNDERSTORM WIND"
levels(NOAA$EVTYPE)[grep("^(?=.*THUN.*|.*TSTM.*|.*TUNDER.*|.*THUDERSTORM.*)(?!.*MARINE)",
levels(NOAA$EVTYPE),
ignore.case = TRUE,
perl = TRUE)] <- "THUNDERSTORM WIND"
levels(NOAA$EVTYPE)[grep("RIP CURRENT",
levels(NOAA$EVTYPE),
ignore.case = TRUE)] <- "RIP CURRENT"
levels(NOAA$EVTYPE)[grep("SEICHE",
levels(NOAA$EVTYPE),
ignore.case = TRUE)] <- "SEICHE"
levels(NOAA$EVTYPE)[grep("SLEET|FREEZING (DRIZZLE|RAIN|SPRAY)",
levels(NOAA$EVTYPE),
ignore.case = TRUE)] <- "SLEET"
levels(NOAA$EVTYPE)[grep("^(?=.*(STORM SURGE|TIDE))(?!.*ASTRONOMICAL)",
levels(NOAA$EVTYPE),
ignore.case = TRUE,
perl = TRUE)] <- "STORM SURGE/TIDE"
levels(NOAA$EVTYPE)[grep("TORNADO|TORNDAO",
levels(NOAA$EVTYPE),
ignore.case = TRUE)] <- "TORNADO"
levels(NOAA$EVTYPE)[grep("TROPICAL DEPRESSION",
levels(NOAA$EVTYPE),
ignore.case = TRUE)] <- "TROPICAL DEPRESSION"
levels(NOAA$EVTYPE)[grep("TROPICAL STORM",
levels(NOAA$EVTYPE),
ignore.case = TRUE)] <- "TROPICAL STORM"
levels(NOAA$EVTYPE)[grep("TSUNAMI",
levels(NOAA$EVTYPE),
ignore.case = TRUE)] <- "TSUNAMI"
levels(NOAA$EVTYPE)[grep("VOLCANIC ASH",
levels(NOAA$EVTYPE),
ignore.case = TRUE)] <- "VOLCANIC ASH"
levels(NOAA$EVTYPE)[grep("WATER.*SPOUT|WAYTERSPOUT",
levels(NOAA$EVTYPE),
ignore.case = TRUE)] <- "WATERSPOUT"
levels(NOAA$EVTYPE)[grep("FIRE",
levels(NOAA$EVTYPE),
ignore.case = TRUE)] <- "WILDFIRE"
levels(NOAA$EVTYPE)[grep("WINTER STORM",
levels(NOAA$EVTYPE),
ignore.case = TRUE)] <- "WINTER STORM"
levels(NOAA$EVTYPE)[grep("WINTER WEATHER|WINTERY|WINTRY",
levels(NOAA$EVTYPE),
ignore.case = TRUE)] <- "WINTER WEATHER"
There also some EVTYPE that are sub-categories of one main event. Therefore, they are categorized under that main event.
levels(NOAA$EVTYPE)[levels(NOAA$EVTYPE) == "ASTRONOMICAL HIGH TIDE"] <- "STORM SURGE/TIDE"
levels(NOAA$EVTYPE)[levels(NOAA$EVTYPE) == "VOLCANIC ERUPTION"] <- "VOLCANIC ASH"
levels(NOAA$EVTYPE)[grep("DOWNBURST|MICROBURST|MIRCOBURST|GUSTNADO",
levels(NOAA$EVTYPE),
ignore.case = TRUE)] <- "THUNDERSTORM WIND"
levels(NOAA$EVTYPE)[grep("BEACH EROSI",
levels(NOAA$EVTYPE),
ignore.case = TRUE)] <- "COASTAL FLOOD"
levels(NOAA$EVTYPE)[grep("MUDSLIDE|LANDSLIDE",
levels(NOAA$EVTYPE),
ignore.case = TRUE)] <- "DEBRIS FLOW"
levels(NOAA$EVTYPE)[grep("BLOWING SNOW",
levels(NOAA$EVTYPE),
ignore.case = TRUE)] <- "BLIZZARD"
Finally, according to the documentation, estimates related to property and crop damages are rounded to three significant digits followed by an alphabetical character signifying the magnitude of the number (B/b are for billion, M/m express million and K/k indicate kilo). We need to clean these entries.
NOAA[NOAA$PROPDMGEXP == "K", ]$PROPDMG <- NOAA[NOAA$PROPDMGEXP == "K", ]$PROPDMG * 1000
NOAA[NOAA$PROPDMGEXP == "M", ]$PROPDMG <- NOAA[NOAA$PROPDMGEXP == "M", ]$PROPDMG * 1e+06
NOAA[NOAA$PROPDMGEXP == "m", ]$PROPDMG <- NOAA[NOAA$PROPDMGEXP == "m", ]$PROPDMG * 1e+06
NOAA[NOAA$PROPDMGEXP == "B", ]$PROPDMG <- NOAA[NOAA$PROPDMGEXP == "B", ]$PROPDMG * 1e+09
NOAA[NOAA$CROPDMGEXP == "K", ]$CROPDMG <- NOAA[NOAA$CROPDMGEXP == "K", ]$CROPDMG * 1000
NOAA[NOAA$CROPDMGEXP == "k", ]$CROPDMG <- NOAA[NOAA$CROPDMGEXP == "k", ]$CROPDMG * 1000
NOAA[NOAA$CROPDMGEXP == "M", ]$CROPDMG <- NOAA[NOAA$CROPDMGEXP == "M", ]$CROPDMG * 1e+06
NOAA[NOAA$CROPDMGEXP == "m", ]$CROPDMG <- NOAA[NOAA$CROPDMGEXP == "m", ]$CROPDMG * 1e+06
NOAA[NOAA$CROPDMGEXP == "B", ]$CROPDMG <- NOAA[NOAA$CROPDMGEXP == "B", ]$CROPDMG * 1e+09
Now that we have cleaned up the data, we can answer our first question: which types of weather events are most harmful with respect to population health across the U.S. ? To show how harmful is a specific cluster of events on the population, we calculate the total number of fatalities per event as well as the total number of injuries per event. This is the perfect job for the aggregate function.
NOAA.fatality <- aggregate(FATALITIES ~ EVTYPE, data = NOAA, sum)
NOAA.fatality <- NOAA.fatality[NOAA.fatality$FATALITIES > 0,]
NOAA.fatality <- NOAA.fatality[order(NOAA.fatality$FATALITIES, decreasing = TRUE),]
NOAA.injury <- aggregate(INJURIES ~ EVTYPE, data = NOAA, sum)
NOOA.injury <- NOAA.injury[NOAA.injury$INJURIES > 0,]
NOAA.injury <- NOAA.injury[order(NOAA.injury$INJURIES, decreasing = TRUE),]
par(mfrow = c(2, 1))
barplot(NOAA.fatality[1:10, 2], col = rainbow(10), legend.text = NOAA.fatality[1:10,1],
ylab = "Fatalities", main = "Total Fatalities per Event (19050 - 2011)")
barplot(NOAA.injury[1:10, 2], col = rainbow(10), legend.text = NOAA.injury[1:10,1],
ylab = "Injuries", main = "Total Injuries per Event (1950 - 2011)")
One can sees that tornadoes are by far responsible for both most fatalities and injuries
We can now focus on the second question: which types of weather events have the greatest economic consequences across the U.S. ? We will consider property and crop damages here.
NOAA.property <- aggregate(PROPDMG ~ EVTYPE, data = NOAA, sum)
NOAA.property <- NOAA.property[NOAA.property$PROPDMG > 0,]
NOAA.property <- NOAA.property[order(NOAA.property$PROPDMG, decreasing = TRUE),]
NOAA.crop <- aggregate(CROPDMG ~ EVTYPE, data = NOAA, sum)
NOOA.crop <- NOAA.crop[NOAA.crop$CROPDMG > 0,]
NOAA.crop <- NOAA.crop[order(NOAA.crop$CROPDMG, decreasing = TRUE),]
par(mfrow = c(2, 1))
barplot(NOAA.property[1:10, 2], col = rainbow(10), legend.text = NOAA.property[1:10,1],
ylab = "Property damage (dollars)", main = "Total Property Damage per Event (1950 - 2011)")
barplot(NOAA.crop[1:10, 2], col = rainbow(10), legend.text = NOAA.crop[1:10,1],
ylab = "Crop Damage (dollars)", main = "Total Crop Damage per Event (1950 - 2011)")
We can see that Flood and Drought are responsible for most of the property and crop damage, respectively.