Synopsis

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage. The events in the database start in the year 1950 and end in November 2011.

The goal of this data analysis is to explore the NOAA Storm Database in order to answer two questions about severe weather events:

  1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

  2. Across the United States, which types of events have the greatest economic consequences?

As shown in the results section, tornados are most harmful with respect to both population health and economy.

Data Processing

The NOAA Storm Database used to perform the analysis, can be downloaded here.

There is also some documentation available, where it’s explained how some of the variables are constructed/defined.

#download the file if it's not present in the data/ directory
fileName <- "data/StormData.csv"
if(!file.exists(fileName)) {
  dir.create("data", showWarnings = FALSE)
  download.file("https://d396qusza40orc.cloudfront.net/repdata/data/StormData.csv.bz2", 
              "data/StormData.csv.bz2", method="curl")
  #R.utils is needed to decompress files in bzip format
  library(R.utils)
  bunzip2("data/StormData.csv.bz2")
}
#cache this block to avoid reloading the dataset every time
stormData <- read.csv(fileName)

Get an overview of what the dataset contains.

str(stormData)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","  N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels ""," Christiansburg",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels ""," CANTON"," TULIA",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels ""," CI","%SD",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436781 levels "","\t","\t\t",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

From the documentation, we know that to answer the questions object of this study only the following columns are needed:

So, first we extract these columns from the initial storm dataset.

stormData <- stormData[, c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", 
                      "CROPDMG", "CROPDMGEXP")]

Results

1. Most harmful type of event with respect to population health

In order to answer the first question, we need to summarize the total number of human casualties over the different event types.

#aggregate data
totalCasualties <- aggregate(FATALITIES + INJURIES ~ EVTYPE, data = stormData, FUN = sum)
colnames(totalCasualties) <- c("EVTYPE", "CASUALTIES")
#order from higher to lower
totalCasualties <- totalCasualties[order(totalCasualties$CASUALTIES, decreasing = T), ]
#let's see the top 25 events 
head(totalCasualties, 25)
##                 EVTYPE CASUALTIES
## 834            TORNADO      96979
## 130     EXCESSIVE HEAT       8428
## 856          TSTM WIND       7461
## 170              FLOOD       7259
## 464          LIGHTNING       6046
## 275               HEAT       3037
## 153        FLASH FLOOD       2755
## 427          ICE STORM       2064
## 760  THUNDERSTORM WIND       1621
## 972       WINTER STORM       1527
## 359          HIGH WIND       1385
## 244               HAIL       1376
## 411  HURRICANE/TYPHOON       1339
## 310         HEAVY SNOW       1148
## 957           WILDFIRE        986
## 786 THUNDERSTORM WINDS        972
## 30            BLIZZARD        906
## 188                FOG        796
## 585        RIP CURRENT        600
## 955   WILD/FOREST FIRE        557
## 586       RIP CURRENTS        501
## 278          HEAT WAVE        481
## 117         DUST STORM        462
## 978     WINTER WEATHER        431
## 848     TROPICAL STORM        398

Tornados appear to be the most harmful weather events for humans. Next in the list are excessive heat, wind, floods and lightnings. Nevertheless, some event types seem to refer to the same weather event, but names in database are slightly different. For example, THUNDERSTORM WIND appears also as THUNDERSTORM WINDS, TSTM WIND, TSTM WINDS, and more. The same for FLOOD, in the event types we can find also FLASH FLOOD, FLASH FLOODING, FLOOD/FLASH FLOOD, etc. All these event types are related to same weather event, therefore their effect should be aggregated as only one type.

#which rows contain these event types related to floods
floods <- which(totalCasualties$EVTYPE %in% c("FLOOD", "FLASH FLOOD", "FLASH FLOODING", "FLOOD/FLASH FLOOD", "FLOODS", "FLASH FLOODS") )
#unify event type name 
totalCasualties$EVTYPE[floods] <- rep("FLOOD", length(floods))
#now do the same for thunderstorm winds
tstm <- which(totalCasualties$EVTYPE %in% c("THUNDERSTORM WIND", "TSTM WIND", "THUNDERSTORM WINDS", "TSTM WINDS", "TUNDERSTORM WIND", "TSTMW", "TSTM WND", "TSTM WINDS"))
totalCasualties$EVTYPE[tstm] <- rep("THUNDERSTORM WIND", length(tstm))
#same for heat
heat <- which(totalCasualties$EVTYPE %in% c("EXCESSIVE HEAT", "HEAT") )
totalCasualties$EVTYPE[heat] <- rep("HEAT", length(heat))

#now aggregate again over event types
totalCasualties <- aggregate(CASUALTIES ~ EVTYPE, data = totalCasualties, FUN = sum)
#reorder again
totalCasualties <- totalCasualties[order(totalCasualties$CASUALTIES, decreasing = T), ]
#let's see the top 25 events 
head(totalCasualties, 25)
##                EVTYPE CASUALTIES
## 827           TORNADO      96979
## 269              HEAT      11465
## 166             FLOOD      10075
## 754 THUNDERSTORM WIND      10054
## 458         LIGHTNING       6046
## 421         ICE STORM       2064
## 960      WINTER STORM       1527
## 353         HIGH WIND       1385
## 238              HAIL       1376
## 405 HURRICANE/TYPHOON       1339
## 304        HEAVY SNOW       1148
## 945          WILDFIRE        986
## 30           BLIZZARD        906
## 182               FOG        796
## 579       RIP CURRENT        600
## 943  WILD/FOREST FIRE        557
## 580      RIP CURRENTS        501
## 272         HEAT WAVE        481
## 117        DUST STORM        462
## 966    WINTER WEATHER        431
## 841    TROPICAL STORM        398
## 19          AVALANCHE        394
## 139      EXTREME COLD        391
## 670       STRONG WIND        383
## 89          DENSE FOG        360

Tornado is still clearly the event type causing greatest number of casualties and below in the list, the distribution has not changed that much, but the number of casualties due to heat, floods and thunderstorm winds is now higher.

library(ggplot2)
topCas <- totalCasualties[1:10,]
#reorder levels of factor to show events in decreasing order of harmfulness
topCas$EVTYPE <- factor(topCas$EVTYPE, levels = topCas$EVTYPE[order(topCas$CASUALTIES, decreasing = T)])
ggplot(topCas, aes(EVTYPE, CASUALTIES)) + 
  geom_bar(stat = "identity", colour="black", fill="#DD8888", width=0.7) + 
  ylab("Casualties") + xlab("Event Type") + 
  ggtitle("Human Casualties due to severe weather events in the U.S. (1950-2011)") +
    theme(axis.text=element_text(size=7), title=element_text(size=10))

plot of chunk plotCasualties

2. Types of events with the greatest economic consequences

To address the second question, it’s also necessary to aggregate data over event types. But the data provided in CROPDMG and PROPDMG columns needs first to be converted, so all the values are expressed in the same units. To properly convert the amounts, PROPDMGEXP and CROPDMGEXP columns have to be used. From the documentation, we now that “h” or “H” mean 10^2, “k” or “K” mean 10^3, “m” or “M” mean 10^6 and “b” or “B” mean 10^9.

Inspecting the data stored in PROPDMGEXP and CROPDMGEXP, also other characters like “-”, “+” or “?” and the numbers like “1”, “2”, etc. appear. I couldn’t find information about their meaning in the documentation, though.

summary(stormData$PROPDMGEXP)
##             -      ?      +      0      1      2      3      4      5 
## 465934      1      8      5    216     25     13      4      4     28 
##      6      7      8      B      h      H      K      m      M 
##      4      5      1     40      1      6 424665      7  11330
summary(stormData$CROPDMGEXP)
##             ?      0      2      B      k      K      m      M 
## 618413      7     19      1      9     21 281832      1   1994

Since these characters are not frequently occurring compared to the letters, I decide to ignore them.

Next, let’s unify the units to millions of dollars, which will be easier to handle than amounts in dollars

#unify units for property damage
#if there is no valid character specified, assume that the amount is in dollars with no scale
stormData[!stormData$PROPDMGEXP %in% c("h", "k", "b"), 4] <- stormData[!stormData$PROPDMGEXP %in% c("h", "k", "b"), 4] * 10^-6
stormData[stormData$PROPDMGEXP == "h", 4] <- stormData[stormData$PROPDMGEXP == "h", 4] * 10^-4
stormData[stormData$PROPDMGEXP == "k", 4] <- stormData[stormData$PROPDMGEXP == "k", 4] * 10^-3
stormData[stormData$PROPDMGEXP == "b", 4] <- stormData[stormData$PROPDMGEXP == "b", 4] * 10^3
#now for crop damage
stormData[!stormData$CROPDMGEXP %in% c("h", "k", "b"), 6] <- stormData[!stormData$CROPDMGEXP %in% c("h", "k", "b"), 6] * 10^-6
stormData[stormData$CROPDMGEXP == "h", 6] <- stormData[stormData$CROPDMGEXP == "h", 6] * 10^-4
stormData[stormData$CROPDMGEXP == "k", 6] <- stormData[stormData$CROPDMGEXP == "k", 6] * 10^-3
stormData[stormData$CROPDMGEXP == "b", 6] <- stormData[stormData$CROPDMGEXP == "b", 6] * 10^3

Once the amounts are in the same units, both property and crop damages can be aggregated to find out which are the types of events with greatest economic consequences.

totalDmg <- aggregate(PROPDMG + CROPDMG ~ EVTYPE, data = stormData, FUN = sum)
colnames(totalDmg) <- c("EVTYPE", "AMOUNT")
#order from higher to lower
totalDmg <- totalDmg[order(totalDmg$AMOUNT, decreasing = T), ]
#show top 25 most harmful event types
head(totalDmg, 25)
##                   EVTYPE  AMOUNT
## 834              TORNADO 3.31228
## 244                 HAIL 1.68487
## 153          FLASH FLOOD 1.59933
## 856            TSTM WIND 1.44517
## 170                FLOOD 1.06798
## 760    THUNDERSTORM WIND 0.94364
## 464            LIGHTNING 0.60693
## 786   THUNDERSTORM WINDS 0.46897
## 359            HIGH WIND 0.34201
## 972         WINTER STORM 0.13470
## 310           HEAVY SNOW 0.12442
## 957             WILDFIRE 0.08882
## 427            ICE STORM 0.06769
## 676          STRONG WIND 0.06461
## 290           HEAVY RAIN 0.06196
## 376           HIGH WINDS 0.05738
## 848       TROPICAL STORM 0.05432
## 164       FLASH FLOODING 0.04361
## 955     WILD/FOREST FIRE 0.04353
## 95               DROUGHT 0.03800
## 919 URBAN/SML STREAM FLD 0.02885
## 30              BLIZZARD 0.02549
## 402            HURRICANE 0.02085
## 177    FLOOD/FLASH FLOOD 0.02058
## 670          STORM SURGE 0.01940

Tornados are also the most harmful weather event from the economic point of view. Next in the list appear floods and wind. The problem with the names of the event types is also happening in that case. So. let’s try to unify names of the most harmful events.

#which rows contain these event types related to floods
floods <- which(totalDmg$EVTYPE %in% c("FLOOD", "FLASH FLOOD", "FLASH FLOODING", "FLOOD/FLASH FLOOD", "FLOODS", "FLASH FLOODS") )
#unify event type name 
totalDmg$EVTYPE[floods] <- rep("FLOOD", length(floods))
#now do the same for thunderstorm winds
tstm <- which(totalDmg$EVTYPE %in% c("THUNDERSTORM WIND", "TSTM WIND", "THUNDERSTORM WINDS", "TSTM WINDS", "TUNDERSTORM WIND", "TSTMW", "TSTM WND", "TSTM WINDS"))
totalDmg$EVTYPE[tstm] <- rep("THUNDERSTORM WIND", length(tstm))

#now aggregate again over event types
totalDmg <- aggregate(AMOUNT ~ EVTYPE, data = totalDmg, FUN = sum)
#reorder again
totalDmg <- totalDmg[order(totalDmg$AMOUNT, decreasing = T), ]
#let's see the top 25 events 
head(totalDmg, 25)
##                   EVTYPE  AMOUNT
## 828              TORNADO 3.31228
## 755    THUNDERSTORM WIND 2.85796
## 167                FLOOD 2.73419
## 239                 HAIL 1.68487
## 459            LIGHTNING 0.60693
## 354            HIGH WIND 0.34201
## 961         WINTER STORM 0.13470
## 305           HEAVY SNOW 0.12442
## 946             WILDFIRE 0.08882
## 422            ICE STORM 0.06769
## 671          STRONG WIND 0.06461
## 285           HEAVY RAIN 0.06196
## 371           HIGH WINDS 0.05738
## 842       TROPICAL STORM 0.05432
## 944     WILD/FOREST FIRE 0.04353
## 95               DROUGHT 0.03800
## 908 URBAN/SML STREAM FLD 0.02885
## 30              BLIZZARD 0.02549
## 397            HURRICANE 0.02085
## 665          STORM SURGE 0.01940
## 437            LANDSLIDE 0.01900
## 585          RIVER FLOOD 0.01735
## 894          URBAN FLOOD 0.01422
## 435     LAKE-EFFECT SNOW 0.01414
## 140         EXTREME COLD 0.01378

Tornados are still in first place, but the difference with the second and the third event types, i.e. thunderstorm wind and floods is not so high as in the case of human casualties.

topDmg <- totalDmg[1:10,]
#reorder levels of factor to show events in decreasing order of harmfulness
topDmg$EVTYPE <- factor(topDmg$EVTYPE, levels = topDmg$EVTYPE[order(topDmg$AMOUNT, decreasing = T)])
ggplot(topDmg, aes(EVTYPE, AMOUNT)) + 
  geom_bar(stat = "identity", colour="black", fill="#DD8888", width=0.7) + 
  ylab("Economic Damages (million dollars)") + xlab("Event Type") + 
  ggtitle("Economic Damage Estimate caused by severe weather events in the U.S. (1950-2011)") +
  theme(axis.text=element_text(size=6), title=element_text(size=10))

plot of chunk plotDamages