Synopsis

This analysis examines the damage caused by storm events in the U.S. recorded by the National Oceanic and Atmospheric Administration between 1950 and 2011. The primary focuses of this analysis are damage to health and life, measured by injuries and fatalities, and damage to property and crops, measured in U.S. dollars. Because events types were recorded in a non-uniform way by NOAA staff over the years, I take steps to clean the event type variable and make it ready for use in the analysis. I also scale the property damage and crop damage variables to make them ready for comparison.

I find that the greatest damage to people, as measured by aggregate fatalities and injuries, comes from tornadoes, though significant damage is also caused by floods. By contrast, floods are the main source of property damage and crop damage.

Data processing

Before starting, we load the necessary packages.

library(dplyr)
library(ggplot2)
library(R.utils)
## Warning: package 'R.utils' was built under R version 4.0.4
## Warning: package 'R.oo' was built under R version 4.0.3
## Warning: package 'R.methodsS3' was built under R version 4.0.3

To begin, I download the data file to my computer from the NOAA website. I then use bunzip2 to convert the file to csv.

url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
destfile <- "C:/Users/19732/Downloads/repdata_data_StormData.csv.bz2"
download.file(url, destfile)
bunzip2(filename="repdata_data_StormData.csv.bz2", destname="repdata_data_StormData.csv")

I then load the file into R.

setwd("C:/Users/19732/Downloads")
stormData <- read.csv("repdata_data_StormData.csv")

Next I use str to understand the structure of the data.

str(stormData)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  "" "" "" "" ...
##  $ BGN_LOCATI: chr  "" "" "" "" ...
##  $ END_DATE  : chr  "" "" "" "" ...
##  $ END_TIME  : chr  "" "" "" "" ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  "" "" "" "" ...
##  $ END_LOCATI: chr  "" "" "" "" ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ WFO       : chr  "" "" "" "" ...
##  $ STATEOFFIC: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  "" "" "" "" ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

First we notice that many of these variables have empty values, such as BGN_AZI. Similarly, COUNTYENDN begins with several NA values.

We are primarily interested here in how event type is related to varying levels of damage to population health and economic consequences. The variables that will aid us in this analysis are EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP. FATALITIES and INJURIES are integer variables, and EVTYPE is a character variable.

nrow(table(stormData$EVTYPE))
## [1] 985

There are 985 different values that the variable EVTYPE takes. For this analysis, I ignored any values with less than 200 values. This leaves the following event types:

table1 <- data.frame(table(stormData$EVTYPE))
table1 <- table1 %>% filter(Freq >= 200)
table1
##                        Var1   Freq
## 1                 AVALANCHE    386
## 2                  BLIZZARD   2719
## 3             COASTAL FLOOD    650
## 4           COLD/WIND CHILL    539
## 5                 DENSE FOG   1293
## 6                   DROUGHT   2488
## 7                DUST STORM    427
## 8            EXCESSIVE HEAT   1678
## 9              EXTREME COLD    655
## 10  EXTREME COLD/WIND CHILL   1002
## 11        EXTREME WINDCHILL    204
## 12              FLASH FLOOD  54277
## 13           FLASH FLOODING    682
## 14                    FLOOD  25326
## 15        FLOOD/FLASH FLOOD    624
## 16                      FOG    538
## 17            FREEZING RAIN    250
## 18             FROST/FREEZE   1342
## 19             FUNNEL CLOUD   6839
## 20                     HAIL 288661
## 21                     HEAT    767
## 22               HEAVY RAIN  11723
## 23               HEAVY SNOW  15708
## 24     HEAVY SURF/HIGH SURF    228
## 25                HIGH SURF    725
## 26                HIGH WIND  20212
## 27               HIGH WINDS   1533
## 28                ICE STORM   2006
## 29         LAKE-EFFECT SNOW    636
## 30                LANDSLIDE    600
## 31                LIGHTNING  15754
## 32              MARINE HAIL    442
## 33 MARINE THUNDERSTORM WIND   5812
## 34         MARINE TSTM WIND   6175
## 35              RIP CURRENT    470
## 36             RIP CURRENTS    304
## 37                     SNOW    587
## 38              STORM SURGE    261
## 39              STRONG WIND   3566
## 40        THUNDERSTORM WIND  82563
## 41       THUNDERSTORM WINDS  20843
## 42                  TORNADO  60652
## 43           TROPICAL STORM    690
## 44                TSTM WIND 219940
## 45           TSTM WIND/HAIL   1028
## 46              URBAN FLOOD    249
## 47     URBAN/SML STREAM FLD   3392
## 48               WATERSPOUT   3796
## 49         WILD/FOREST FIRE   1457
## 50                 WILDFIRE   2761
## 51                     WIND    340
## 52             WINTER STORM  11433
## 53           WINTER WEATHER   7026
## 54       WINTER WEATHER/MIX   1104

Let’s now note in the dataset which rows fall into our top event types listed above.

stormData$topEVTYPE <- stormData$EVTYPE %in% table1$Var1

We can see here that the subset of EVTYPE values included captures 14,284 of a total of 15,145 fatalities:

aggregate(FATALITIES~ topEVTYPE, data=stormData, sum)
##   topEVTYPE FATALITIES
## 1     FALSE        861
## 2      TRUE      14284

I also included any value containing “HURRICANE”, as hurricanes may contribute a large share of economic and human damage despite being rare. Here I code all hurricanes in the group “HURRICANE”.

stormData$EVTYPE <- ifelse(grepl("HURRICANE", stormData$EVTYPE, ignore.case=TRUE), "HURRICANE", stormData$EVTYPE)

At this step, I combine EVTYPE values that are included in our list above but refer to the same type of weather event:

stormData$EVTYPE <- ifelse(stormData$EVTYPE=="EXCESSIVE HEAT","HEAT", stormData$EVTYPE)
stormData$EVTYPE <- ifelse(stormData$EVTYPE=="FLOOD/FLASH FLOOD","FLOOD", stormData$EVTYPE)
stormData$EVTYPE <- ifelse(stormData$EVTYPE=="FLASH FLOODING","FLASH FLOOD", stormData$EVTYPE)
stormData$EVTYPE <- ifelse(stormData$EVTYPE=="HEAVY SURF/HIGH SURF","HIGH SURF", stormData$EVTYPE)
stormData$EVTYPE <- ifelse(stormData$EVTYPE=="HIGH WINDS","HIGH WIND", stormData$EVTYPE)
stormData$EVTYPE <- ifelse(stormData$EVTYPE=="MARINE TSTM WIND","MARINE THUNDERSTORM WIND", stormData$EVTYPE)
stormData$EVTYPE <- ifelse(stormData$EVTYPE=="RIP CURRENT","RIP CURRENTS", stormData$EVTYPE)
stormData$EVTYPE <- ifelse(stormData$EVTYPE=="THUNDERSTORM WIND","THUNDERSTORM WINDS", stormData$EVTYPE)
stormData$EVTYPE <- ifelse(stormData$EVTYPE=="TSTM WIND/HAIL","TSTM WIND", stormData$EVTYPE)
stormData$EVTYPE <- ifelse(stormData$EVTYPE=="URBAN/SML STREAM FLD","URBAN FLOOD", stormData$EVTYPE)
stormData$EVTYPE <- ifelse(stormData$EVTYPE=="WINTER WEATHER/MIX","WINTER WEATHER", stormData$EVTYPE)
stormData$EVTYPE <- ifelse(stormData$EVTYPE=="WILD/FOREST FIRE","WILDFIRE", stormData$EVTYPE)

Next, let’s examine PROPDMG, CROPDMG, PROPDMGEXP, and CROPDMGEXP. Here we can see the values that PROPDMGEXP and CROPDMGEXP can take.

table(stormData$PROPDMGEXP)
## 
##             -      ?      +      0      1      2      3      4      5      6 
## 465934      1      8      5    216     25     13      4      4     28      4 
##      7      8      B      h      H      K      m      M 
##      5      1     40      1      6 424665      7  11330
table(stormData$CROPDMGEXP)
## 
##             ?      0      2      B      k      K      m      M 
## 618413      7     19      1      9     21 281832      1   1994

The vast majority of rows are either blank or a capital letter representing an exponent of ten: H, K, M, and B. Therefore, I ignore rows with any other values. This variable converts the originals into numeric scalars.

stormData$PROPDMGEXP2 <- ifelse(stormData$PROPDMGEXP=="H", 100, 
                          ifelse(stormData$PROPDMGEXP=="K", 1000,
                          ifelse(stormData$PROPDMGEXP=="M", 1000000,
                          ifelse(stormData$PROPDMGEXP=="B", 1000000000, NA))))
stormData$CROPDMGEXP2 <- ifelse(stormData$CROPDMGEXP=="H", 100, 
                          ifelse(stormData$CROPDMGEXP=="K", 1000,
                          ifelse(stormData$CROPDMGEXP=="M", 1000000,
                          ifelse(stormData$CROPDMGEXP=="B", 1000000000, NA))))

We can now create two new variables that scale PROPDMG and CROPDMG uniformly.

stormData$PROPDMG2 <- stormData$PROPDMG * stormData$PROPDMGEXP2
stormData$CROPDMG2 <- stormData$CROPDMG * stormData$CROPDMGEXP2

Let’s now filter out the rows that we don’t want included in our analysis. First, we only want rows with EVTYPES that are in our topEVTYPE list.

stormData2 <- filter(stormData, topEVTYPE==TRUE)

Next, we will filter out the rows that don’t have full data for FATALITIES, INJURIES, CROPDMG2, and PROPDMG2.

stormData2$fullData <- ifelse((is.na(stormData2$FATALITIES)==FALSE & 
                               is.na(stormData2$INJURIES)==FALSE &
                               is.na(stormData2$PROPDMG2)==FALSE & 
                               is.na(stormData2$CROPDMG2)==FALSE &
                               is.numeric(stormData2$FATALITIES)==TRUE &
                               is.numeric(stormData2$INJURIES)==TRUE &
                               is.numeric(stormData2$PROPDMG2)==TRUE & 
                               is.numeric(stormData2$CROPDMG2)==TRUE
                                   ), 1, 0)
stormData3 <- filter(stormData2, fullData==1)
nrow(stormData3)
## [1] 278492

This leaves about 278,000 storm events in our data set. With our data now prepared, we can proceed to analyze damage from these storm events.

Results

Damage to people

Our first question is, “Across the United States, which types of events are most harmful with respect to population health?” First we will aggregate fatalities and injuries by event type.

fatalitiesByEventType <- aggregate(FATALITIES ~ EVTYPE, data=stormData3, sum)
injuriesByEventType <- aggregate(INJURIES ~ EVTYPE, data=stormData3, sum)

Now we will print out these tables in descending order.

fatalitiesByEventType[order(fatalitiesByEventType$FATALITIES, decreasing = TRUE), ]
##                      EVTYPE FATALITIES
## 32                  TORNADO       1064
## 10              FLASH FLOOD        390
## 16                     HEAT        375
## 11                    FLOOD        264
## 27             RIP CURRENTS        211
## 24                LIGHTNING        173
## 31       THUNDERSTORM WINDS        141
## 4           COLD/WIND CHILL         94
## 1                 AVALANCHE         83
## 19                HIGH SURF         70
## 20                HIGH WIND         63
## 37                 WILDFIRE         60
## 9   EXTREME COLD/WIND CHILL         58
## 30              STRONG WIND         55
## 39           WINTER WEATHER         30
## 2                  BLIZZARD         23
## 38             WINTER STORM         20
## 33           TROPICAL STORM         19
## 17               HEAVY RAIN         16
## 21                ICE STORM         13
## 18               HEAVY SNOW         12
## 34                TSTM WIND         11
## 26 MARINE THUNDERSTORM WIND         10
## 7                DUST STORM          5
## 23                LANDSLIDE          5
## 3             COASTAL FLOOD          3
## 8              EXTREME COLD          3
## 15                     HAIL          2
## 5                 DENSE FOG          0
## 6                   DROUGHT          0
## 12                      FOG          0
## 13             FROST/FREEZE          0
## 14             FUNNEL CLOUD          0
## 22         LAKE-EFFECT SNOW          0
## 25              MARINE HAIL          0
## 28                     SNOW          0
## 29              STORM SURGE          0
## 35              URBAN FLOOD          0
## 36               WATERSPOUT          0
injuriesByEventType[order(injuriesByEventType$INJURIES, decreasing = TRUE), ]
##                      EVTYPE INJURIES
## 32                  TORNADO    11960
## 11                    FLOOD     6495
## 16                     HEAT     2071
## 21                ICE STORM     1616
## 31       THUNDERSTORM WINDS     1491
## 24                LIGHTNING     1010
## 10              FLASH FLOOD      669
## 37                 WILDFIRE      582
## 2                  BLIZZARD      407
## 20                HIGH WIND      398
## 34                TSTM WIND      375
## 39           WINTER WEATHER      324
## 33           TROPICAL STORM      311
## 15                     HAIL      303
## 27             RIP CURRENTS      140
## 30              STRONG WIND      135
## 19                HIGH SURF      125
## 38             WINTER STORM      114
## 18               HEAVY SNOW       92
## 7                DUST STORM       69
## 1                 AVALANCHE       58
## 17               HEAVY RAIN       47
## 26 MARINE THUNDERSTORM WIND       26
## 4           COLD/WIND CHILL       12
## 23                LANDSLIDE       12
## 9   EXTREME COLD/WIND CHILL        7
## 12                      FOG        5
## 6                   DROUGHT        4
## 5                 DENSE FOG        3
## 3             COASTAL FLOOD        1
## 8              EXTREME COLD        0
## 13             FROST/FREEZE        0
## 14             FUNNEL CLOUD        0
## 22         LAKE-EFFECT SNOW        0
## 25              MARINE HAIL        0
## 28                     SNOW        0
## 29              STORM SURGE        0
## 35              URBAN FLOOD        0
## 36               WATERSPOUT        0

Let’s now see how these groups compare visually using fatalities and injuries. First we merge our fatalities dataset and our injuries dataset.

fatalitiesByEventType$MEASURE <- rep("Fatalities", 39)
injuriesByEventType$MEASURE <- rep("Injuries", 39)
fatalitiesByEventType <- rename(fatalitiesByEventType, TOTAL=FATALITIES)
injuriesByEventType <- rename(injuriesByEventType, TOTAL=INJURIES)
fataliesAndInjuriesByEventType <- rbind(fatalitiesByEventType, injuriesByEventType)

Now we can plot the results.

ggplot(data=fataliesAndInjuriesByEventType, aes(x=EVTYPE, y=TOTAL, fill=MEASURE))+geom_bar(stat="identity")+labs(title="Fatalities and injuries by event type")+scale_fill_brewer(palette="Dark2")+theme(plot.margin = margin(t=0.5, r=0.5, b=0.5, l=0.5, "cm"))+coord_flip()+labs(x="Event type", y="Total")

As we saw in our printed lists, the most deadly and injuring events are tornadoes, though floods also contribute many deaths and injuries. Heat can also be quite deadly.

Economic damage

Our second question is, “Across the United States, which types of events have the greatest economic consequences?” First we will aggregate property damage and crop damage by event type.

propDmgByEventType <- aggregate(PROPDMG2 ~ EVTYPE, data=stormData3, sum)
cropDmgByEventType <- aggregate(CROPDMG2 ~ EVTYPE, data=stormData3, sum)

Now we will print out these tables in descending order.

propDmgByEventType[order(propDmgByEventType$PROPDMG2, decreasing = TRUE), ]
##                      EVTYPE     PROPDMG2
## 11                    FLOOD 132904889050
## 32                  TORNADO  16166771690
## 15                     HAIL   7991587690
## 10              FLASH FLOOD   7388886080
## 31       THUNDERSTORM WINDS   3678929940
## 37                 WILDFIRE   3547207470
## 20                HIGH WIND   2517547340
## 33           TROPICAL STORM   1056591350
## 38             WINTER STORM   1017844200
## 21                ICE STORM    903037300
## 34                TSTM WIND    662600660
## 24                LIGHTNING    315273980
## 17               HEAVY RAIN    310701930
## 6                   DROUGHT    233721000
## 18               HEAVY SNOW    178292000
## 3             COASTAL FLOOD    167580560
## 23                LANDSLIDE    150303500
## 30              STRONG WIND    119252060
## 2                  BLIZZARD     94981000
## 19                HIGH SURF     83017500
## 22         LAKE-EFFECT SNOW     40035000
## 39           WINTER WEATHER     19897500
## 35              URBAN FLOOD     12778600
## 13             FROST/FREEZE      9480000
## 9   EXTREME COLD/WIND CHILL      7038000
## 36               WATERSPOUT      5206200
## 7                DUST STORM      3399000
## 16                     HEAT      3058200
## 29              STORM SURGE      2915000
## 5                 DENSE FOG      2842000
## 1                 AVALANCHE      2385800
## 8              EXTREME COLD      2355000
## 4           COLD/WIND CHILL      1990000
## 26 MARINE THUNDERSTORM WIND       436400
## 14             FUNNEL CLOUD        65100
## 12                      FOG        32000
## 25              MARINE HAIL         4000
## 27             RIP CURRENTS         1000
## 28                     SNOW         1000
cropDmgByEventType[order(cropDmgByEventType$CROPDMG2, decreasing = TRUE), ]
##                      EVTYPE   CROPDMG2
## 11                    FLOOD 5265989450
## 21                ICE STORM 5022110000
## 15                     HAIL 2028390900
## 6                   DROUGHT 1652696000
## 10              FLASH FLOOD 1397535100
## 13             FROST/FREEZE  931801000
## 20                HIGH WIND  651624850
## 31       THUNDERSTORM WINDS  600812250
## 34                TSTM WIND  521631450
## 16                     HEAT  493135000
## 33           TROPICAL STORM  451711000
## 32                  TORNADO  353376460
## 37                 WILDFIRE  284810100
## 18               HEAVY SNOW  131643100
## 2                  BLIZZARD  112060000
## 30              STRONG WIND   64948500
## 17               HEAVY RAIN   64585800
## 38             WINTER STORM   23724000
## 23                LANDSLIDE   20017000
## 39           WINTER WEATHER   15000000
## 24                LIGHTNING    5512150
## 35              URBAN FLOOD    3581600
## 7                DUST STORM    2350000
## 8              EXTREME COLD    2255000
## 4           COLD/WIND CHILL     600000
## 26 MARINE THUNDERSTORM WIND      50000
## 28                     SNOW      10000
## 29              STORM SURGE       5000
## 1                 AVALANCHE          0
## 3             COASTAL FLOOD          0
## 5                 DENSE FOG          0
## 9   EXTREME COLD/WIND CHILL          0
## 12                      FOG          0
## 14             FUNNEL CLOUD          0
## 19                HIGH SURF          0
## 22         LAKE-EFFECT SNOW          0
## 25              MARINE HAIL          0
## 27             RIP CURRENTS          0
## 36               WATERSPOUT          0

At the top of the lists above, we see that floods rank as the most damaging events in the aggregate. Ice storms are also highly damaging to crops, and tornadoes can be highly damaging to property.

Let’s now see how these event types compare visually. First we merge our property damage dataset and our crop damage dataset.

propDmgByEventType$MEASURE <- rep("Property Damage", 39)
cropDmgByEventType$MEASURE <- rep("Crop Damage", 39)
propDmgByEventType <- rename(propDmgByEventType, TOTAL=PROPDMG2)
cropDmgByEventType <- rename(cropDmgByEventType, TOTAL=CROPDMG2)
allEconomicDmgByEventType <- rbind(propDmgByEventType, cropDmgByEventType)

Now we can plot the results.

ggplot(data=allEconomicDmgByEventType, aes(x=EVTYPE, y=TOTAL, fill=MEASURE))+geom_bar(stat="identity")+labs(title="Economic damage by event type")+scale_fill_brewer(palette="Dark2")+theme(plot.margin = margin(t=0.5, r=0.5, b=0.5, l=0.5, "cm"))+coord_flip()+labs(x="Event type", y="Damage (in dollars)")

Matching our printed results, we see that floods by far contribute the most damage to property and crops.