Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage. The events in the database start in the year 1950 and end in November 2011.
The goal of this data analysis is to explore the NOAA Storm Database in order to answer two questions about severe weather events:
Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?
As shown in the results section, tornados are most harmful with respect to both population health and economy.
The NOAA Storm Database used to perform the analysis, can be downloaded here.
There is also some documentation available, where it’s explained how some of the variables are constructed/defined.
National Weather Service Storm Data Documentation
National Climatic Data Center Storm Events FAQ
#download the file if it's not present in the data/ directory
fileName <- "data/StormData.csv"
if(!file.exists(fileName)) {
dir.create("data", showWarnings = FALSE)
download.file("https://d396qusza40orc.cloudfront.net/repdata/data/StormData.csv.bz2",
"data/StormData.csv.bz2", method="curl")
#R.utils is needed to decompress files in bzip format
library(R.utils)
bunzip2("data/StormData.csv.bz2")
}
#cache this block to avoid reloading the dataset every time
stormData <- read.csv(fileName)
Get an overview of what the dataset contains.
str(stormData)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels ""," Christiansburg",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels ""," CANTON"," TULIA",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","%SD",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436781 levels "","\t","\t\t",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
From the documentation, we know that to answer the questions object of this study only the following columns are needed:
So, first we extract these columns from the initial storm dataset.
stormData <- stormData[, c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP",
"CROPDMG", "CROPDMGEXP")]
1. Most harmful type of event with respect to population health
In order to answer the first question, we need to summarize the total number of human casualties over the different event types.
#aggregate data
totalCasualties <- aggregate(FATALITIES + INJURIES ~ EVTYPE, data = stormData, FUN = sum)
colnames(totalCasualties) <- c("EVTYPE", "CASUALTIES")
#order from higher to lower
totalCasualties <- totalCasualties[order(totalCasualties$CASUALTIES, decreasing = T), ]
#let's see the top 25 events
head(totalCasualties, 25)
## EVTYPE CASUALTIES
## 834 TORNADO 96979
## 130 EXCESSIVE HEAT 8428
## 856 TSTM WIND 7461
## 170 FLOOD 7259
## 464 LIGHTNING 6046
## 275 HEAT 3037
## 153 FLASH FLOOD 2755
## 427 ICE STORM 2064
## 760 THUNDERSTORM WIND 1621
## 972 WINTER STORM 1527
## 359 HIGH WIND 1385
## 244 HAIL 1376
## 411 HURRICANE/TYPHOON 1339
## 310 HEAVY SNOW 1148
## 957 WILDFIRE 986
## 786 THUNDERSTORM WINDS 972
## 30 BLIZZARD 906
## 188 FOG 796
## 585 RIP CURRENT 600
## 955 WILD/FOREST FIRE 557
## 586 RIP CURRENTS 501
## 278 HEAT WAVE 481
## 117 DUST STORM 462
## 978 WINTER WEATHER 431
## 848 TROPICAL STORM 398
Tornados appear to be the most harmful weather events for humans. Next in the list are excessive heat, wind, floods and lightnings. Nevertheless, some event types seem to refer to the same weather event, but names in database are slightly different. For example, THUNDERSTORM WIND appears also as THUNDERSTORM WINDS, TSTM WIND, TSTM WINDS, and more. The same for FLOOD, in the event types we can find also FLASH FLOOD, FLASH FLOODING, FLOOD/FLASH FLOOD, etc. All these event types are related to same weather event, therefore their effect should be aggregated as only one type.
#which rows contain these event types related to floods
floods <- which(totalCasualties$EVTYPE %in% c("FLOOD", "FLASH FLOOD", "FLASH FLOODING", "FLOOD/FLASH FLOOD", "FLOODS", "FLASH FLOODS") )
#unify event type name
totalCasualties$EVTYPE[floods] <- rep("FLOOD", length(floods))
#now do the same for thunderstorm winds
tstm <- which(totalCasualties$EVTYPE %in% c("THUNDERSTORM WIND", "TSTM WIND", "THUNDERSTORM WINDS", "TSTM WINDS", "TUNDERSTORM WIND", "TSTMW", "TSTM WND", "TSTM WINDS"))
totalCasualties$EVTYPE[tstm] <- rep("THUNDERSTORM WIND", length(tstm))
#same for heat
heat <- which(totalCasualties$EVTYPE %in% c("EXCESSIVE HEAT", "HEAT") )
totalCasualties$EVTYPE[heat] <- rep("HEAT", length(heat))
#now aggregate again over event types
totalCasualties <- aggregate(CASUALTIES ~ EVTYPE, data = totalCasualties, FUN = sum)
#reorder again
totalCasualties <- totalCasualties[order(totalCasualties$CASUALTIES, decreasing = T), ]
#let's see the top 25 events
head(totalCasualties, 25)
## EVTYPE CASUALTIES
## 827 TORNADO 96979
## 269 HEAT 11465
## 166 FLOOD 10075
## 754 THUNDERSTORM WIND 10054
## 458 LIGHTNING 6046
## 421 ICE STORM 2064
## 960 WINTER STORM 1527
## 353 HIGH WIND 1385
## 238 HAIL 1376
## 405 HURRICANE/TYPHOON 1339
## 304 HEAVY SNOW 1148
## 945 WILDFIRE 986
## 30 BLIZZARD 906
## 182 FOG 796
## 579 RIP CURRENT 600
## 943 WILD/FOREST FIRE 557
## 580 RIP CURRENTS 501
## 272 HEAT WAVE 481
## 117 DUST STORM 462
## 966 WINTER WEATHER 431
## 841 TROPICAL STORM 398
## 19 AVALANCHE 394
## 139 EXTREME COLD 391
## 670 STRONG WIND 383
## 89 DENSE FOG 360
Tornado is still clearly the event type causing greatest number of casualties and below in the list, the distribution has not changed that much, but the number of casualties due to heat, floods and thunderstorm winds is now higher.
library(ggplot2)
topCas <- totalCasualties[1:10,]
#reorder levels of factor to show events in decreasing order of harmfulness
topCas$EVTYPE <- factor(topCas$EVTYPE, levels = topCas$EVTYPE[order(topCas$CASUALTIES, decreasing = T)])
ggplot(topCas, aes(EVTYPE, CASUALTIES)) +
geom_bar(stat = "identity", colour="black", fill="#DD8888", width=0.7) +
ylab("Casualties") + xlab("Event Type") +
ggtitle("Human Casualties due to severe weather events in the U.S. (1950-2011)") +
theme(axis.text=element_text(size=7), title=element_text(size=10))
2. Types of events with the greatest economic consequences
To address the second question, it’s also necessary to aggregate data over event types. But the data provided in CROPDMG and PROPDMG columns needs first to be converted, so all the values are expressed in the same units. To properly convert the amounts, PROPDMGEXP and CROPDMGEXP columns have to be used. From the documentation, we now that “h” or “H” mean 10^2, “k” or “K” mean 10^3, “m” or “M” mean 10^6 and “b” or “B” mean 10^9.
Inspecting the data stored in PROPDMGEXP and CROPDMGEXP, also other characters like “-”, “+” or “?” and the numbers like “1”, “2”, etc. appear. I couldn’t find information about their meaning in the documentation, though.
summary(stormData$PROPDMGEXP)
## - ? + 0 1 2 3 4 5
## 465934 1 8 5 216 25 13 4 4 28
## 6 7 8 B h H K m M
## 4 5 1 40 1 6 424665 7 11330
summary(stormData$CROPDMGEXP)
## ? 0 2 B k K m M
## 618413 7 19 1 9 21 281832 1 1994
Since these characters are not frequently occurring compared to the letters, I decide to ignore them.
Next, let’s unify the units to millions of dollars, which will be easier to handle than amounts in dollars
#unify units for property damage
#if there is no valid character specified, assume that the amount is in dollars with no scale
stormData[!stormData$PROPDMGEXP %in% c("h", "k", "b"), 4] <- stormData[!stormData$PROPDMGEXP %in% c("h", "k", "b"), 4] * 10^-6
stormData[stormData$PROPDMGEXP == "h", 4] <- stormData[stormData$PROPDMGEXP == "h", 4] * 10^-4
stormData[stormData$PROPDMGEXP == "k", 4] <- stormData[stormData$PROPDMGEXP == "k", 4] * 10^-3
stormData[stormData$PROPDMGEXP == "b", 4] <- stormData[stormData$PROPDMGEXP == "b", 4] * 10^3
#now for crop damage
stormData[!stormData$CROPDMGEXP %in% c("h", "k", "b"), 6] <- stormData[!stormData$CROPDMGEXP %in% c("h", "k", "b"), 6] * 10^-6
stormData[stormData$CROPDMGEXP == "h", 6] <- stormData[stormData$CROPDMGEXP == "h", 6] * 10^-4
stormData[stormData$CROPDMGEXP == "k", 6] <- stormData[stormData$CROPDMGEXP == "k", 6] * 10^-3
stormData[stormData$CROPDMGEXP == "b", 6] <- stormData[stormData$CROPDMGEXP == "b", 6] * 10^3
Once the amounts are in the same units, both property and crop damages can be aggregated to find out which are the types of events with greatest economic consequences.
totalDmg <- aggregate(PROPDMG + CROPDMG ~ EVTYPE, data = stormData, FUN = sum)
colnames(totalDmg) <- c("EVTYPE", "AMOUNT")
#order from higher to lower
totalDmg <- totalDmg[order(totalDmg$AMOUNT, decreasing = T), ]
#show top 25 most harmful event types
head(totalDmg, 25)
## EVTYPE AMOUNT
## 834 TORNADO 3.31228
## 244 HAIL 1.68487
## 153 FLASH FLOOD 1.59933
## 856 TSTM WIND 1.44517
## 170 FLOOD 1.06798
## 760 THUNDERSTORM WIND 0.94364
## 464 LIGHTNING 0.60693
## 786 THUNDERSTORM WINDS 0.46897
## 359 HIGH WIND 0.34201
## 972 WINTER STORM 0.13470
## 310 HEAVY SNOW 0.12442
## 957 WILDFIRE 0.08882
## 427 ICE STORM 0.06769
## 676 STRONG WIND 0.06461
## 290 HEAVY RAIN 0.06196
## 376 HIGH WINDS 0.05738
## 848 TROPICAL STORM 0.05432
## 164 FLASH FLOODING 0.04361
## 955 WILD/FOREST FIRE 0.04353
## 95 DROUGHT 0.03800
## 919 URBAN/SML STREAM FLD 0.02885
## 30 BLIZZARD 0.02549
## 402 HURRICANE 0.02085
## 177 FLOOD/FLASH FLOOD 0.02058
## 670 STORM SURGE 0.01940
Tornados are also the most harmful weather event from the economic point of view. Next in the list appear floods and wind. The problem with the names of the event types is also happening in that case. So. let’s try to unify names of the most harmful events.
#which rows contain these event types related to floods
floods <- which(totalDmg$EVTYPE %in% c("FLOOD", "FLASH FLOOD", "FLASH FLOODING", "FLOOD/FLASH FLOOD", "FLOODS", "FLASH FLOODS") )
#unify event type name
totalDmg$EVTYPE[floods] <- rep("FLOOD", length(floods))
#now do the same for thunderstorm winds
tstm <- which(totalDmg$EVTYPE %in% c("THUNDERSTORM WIND", "TSTM WIND", "THUNDERSTORM WINDS", "TSTM WINDS", "TUNDERSTORM WIND", "TSTMW", "TSTM WND", "TSTM WINDS"))
totalDmg$EVTYPE[tstm] <- rep("THUNDERSTORM WIND", length(tstm))
#now aggregate again over event types
totalDmg <- aggregate(AMOUNT ~ EVTYPE, data = totalDmg, FUN = sum)
#reorder again
totalDmg <- totalDmg[order(totalDmg$AMOUNT, decreasing = T), ]
#let's see the top 25 events
head(totalDmg, 25)
## EVTYPE AMOUNT
## 828 TORNADO 3.31228
## 755 THUNDERSTORM WIND 2.85796
## 167 FLOOD 2.73419
## 239 HAIL 1.68487
## 459 LIGHTNING 0.60693
## 354 HIGH WIND 0.34201
## 961 WINTER STORM 0.13470
## 305 HEAVY SNOW 0.12442
## 946 WILDFIRE 0.08882
## 422 ICE STORM 0.06769
## 671 STRONG WIND 0.06461
## 285 HEAVY RAIN 0.06196
## 371 HIGH WINDS 0.05738
## 842 TROPICAL STORM 0.05432
## 944 WILD/FOREST FIRE 0.04353
## 95 DROUGHT 0.03800
## 908 URBAN/SML STREAM FLD 0.02885
## 30 BLIZZARD 0.02549
## 397 HURRICANE 0.02085
## 665 STORM SURGE 0.01940
## 437 LANDSLIDE 0.01900
## 585 RIVER FLOOD 0.01735
## 894 URBAN FLOOD 0.01422
## 435 LAKE-EFFECT SNOW 0.01414
## 140 EXTREME COLD 0.01378
Tornados are still in first place, but the difference with the second and the third event types, i.e. thunderstorm wind and floods is not so high as in the case of human casualties.
topDmg <- totalDmg[1:10,]
#reorder levels of factor to show events in decreasing order of harmfulness
topDmg$EVTYPE <- factor(topDmg$EVTYPE, levels = topDmg$EVTYPE[order(topDmg$AMOUNT, decreasing = T)])
ggplot(topDmg, aes(EVTYPE, AMOUNT)) +
geom_bar(stat = "identity", colour="black", fill="#DD8888", width=0.7) +
ylab("Economic Damages (million dollars)") + xlab("Event Type") +
ggtitle("Economic Damage Estimate caused by severe weather events in the U.S. (1950-2011)") +
theme(axis.text=element_text(size=6), title=element_text(size=10))