Storm Data in the United States

Synopsis

Using Storm Data from the National Weather Service, this analysis seeks to answer two important questions about weather events:

  1. Across the United States, which types of events (as indicated in the 𝙴𝚅𝚃𝚈𝙿𝙴 variable) are most harmful with respect to population health?
  2. Across the United States, which types of events have the greatest economic consequences?

To this end, the data is aggregated (by weather event) and analyzed by four key indicators: Fatalities, Injuries, Property Damage, and Crop Damage.

Data Processing

We’ll begin by loading in the required packages and the .csv file. Due to the size of the file, this has been pre-downloaded to our R working directory. It will be loaded from there. No pre-processing has been done to this zipped .csv file:

library(ggplot2, lubridate)
stormdata <- read.csv("repdata%2Fdata%2FStormData.csv.bz2")
str(stormdata)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","  N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels ""," Christiansburg",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels ""," CANTON"," TULIA",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels ""," CI","%SD",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436781 levels "","\t","\t\t",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

Results 1: Events Most Harmful to Population Health

We will start by aggregating the data with respect to the EVTYPE (Weather Event Type) category. There are two indicators in this data set which tell us the impact to population health: FATALATIES, and INJURIES. Let’s look at both.

fatal <- aggregate(FATALITIES ~ EVTYPE, data=stormdata, FUN=sum)
dim(fatal)
## [1] 985   2
injure <- aggregate(INJURIES ~ EVTYPE, data=stormdata, FUN=sum)
dim(injure)
## [1] 985   2
healthimpact <- cbind(fatal, injure$INJURIES)
colnames(healthimpact) <- c("Event", "Fatalities", "Injuries")
head(healthimpact)
##                   Event Fatalities Injuries
## 1    HIGH SURF ADVISORY          0        0
## 2         COASTAL FLOOD          0        0
## 3           FLASH FLOOD          0        0
## 4             LIGHTNING          0        0
## 5             TSTM WIND          0        0
## 6       TSTM WIND (G45)          0        0

Great - there are 985 different types of weather events recorded in this data set, so these summaries look good. Just a note: this is the total number of injuries/fatalities occuring from each weather event in total. not the average number per occurance. We have chosen to analyze the data this way in order to see the total impact.

Because it is difficult to compare a fatality to an injury in this context, we will look for the top 5 events based on fatalities, and compare to the top 5 events based on injuries:

library(plyr)
byfatal <- arrange(healthimpact, desc(Fatalities))
head(byfatal, 5)
##            Event Fatalities Injuries
## 1        TORNADO       5633    91346
## 2 EXCESSIVE HEAT       1903     6525
## 3    FLASH FLOOD        978     1777
## 4           HEAT        937     2100
## 5      LIGHTNING        816     5230
byinjure <- arrange(healthimpact, desc(Injuries))
head(byinjure, 5)
##            Event Fatalities Injuries
## 1        TORNADO       5633    91346
## 2      TSTM WIND        504     6957
## 3          FLOOD        470     6789
## 4 EXCESSIVE HEAT       1903     6525
## 5      LIGHTNING        816     5230

Tornado, Excessive Heat, and Lightning appear on both lists, giving us 7 unique weather events. Let’s look more closely:

topevents <- rbind(byfatal[1:5,], byinjure[2:3,])
topevents
##             Event Fatalities Injuries
## 1         TORNADO       5633    91346
## 2  EXCESSIVE HEAT       1903     6525
## 3     FLASH FLOOD        978     1777
## 4            HEAT        937     2100
## 5       LIGHTNING        816     5230
## 21      TSTM WIND        504     6957
## 31          FLOOD        470     6789
qplot(Fatalities, Injuries, data=topevents, colour=Event, main="Injuries and Fatalities by Weather Event, Top 5")

Far and away, Tornados are the weather event with the most severe impact on population health in the United States.

Results 2: Events With Greatest Economic Impact

Economic Impact is stated in this data set as both Property Damage and Crop Damage. Fortunately, since we are working with dollar impact it will be simple to sum the total damage from weather events and determine financial impact. In this data set however, financial numbers may be entered as either “K” for thousands of dollars, “M” for millions, or “B” for billions of dollars - we’ll need to convert before we continue. Here’s an example of how damages are recorded in this data set:

head(stormdata[,25:28],10)
##    PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1     25.0          K       0           
## 2      2.5          K       0           
## 3     25.0          K       0           
## 4      2.5          K       0           
## 5      2.5          K       0           
## 6      2.5          K       0           
## 7      2.5          K       0           
## 8      2.5          K       0           
## 9     25.0          K       0           
## 10    25.0          K       0

While this is simple in theory, a simple aggregate function reveals to us that it may not be as easy as we thought:

propdamage <- aggregate(PROPDMG ~ PROPDMGEXP, data=stormdata, FUN=sum)
propdamage
##    PROPDMGEXP     PROPDMG
## 1                  527.41
## 2           -       15.00
## 3           ?        0.00
## 4           +      117.00
## 5           0     7108.30
## 6           1        0.00
## 7           2       12.00
## 8           3       20.00
## 9           4       14.50
## 10          5      210.50
## 11          6       65.00
## 12          7       82.00
## 13          8        0.00
## 14          B      275.85
## 15          h        2.00
## 16          H       25.00
## 17          K 10735292.10
## 18          m       38.90
## 19          M   140694.45
cropdamage <- aggregate(CROPDMG ~ CROPDMGEXP, data=stormdata, FUN=sum)
cropdamage
##   CROPDMGEXP    CROPDMG
## 1                 11.00
## 2          ?       0.00
## 3          0     260.00
## 4          2       0.00
## 5          B      13.61
## 6          k     436.00
## 7          K 1342955.91
## 8          m      10.00
## 9          M   34140.80

Clearly, we have some unclean data here. Fortunately, the total amounts are small compared to the EXP variables we care about: B, m, M, k and K. Let’s sum those on each line of the original data set (we’ll put this into thousands, so no transformation to K is required):

stormdata = within(stormdata, {
     prop1 = ifelse(PROPDMGEXP == "B", PROPDMG*1000000, ifelse(PROPDMGEXP == "m", PROPDMG*1000, ifelse(PROPDMGEXP == "M", PROPDMG*1000, PROPDMG)))
     crop1 = ifelse(CROPDMGEXP == "B", CROPDMG*1000000, ifelse(CROPDMGEXP == "m", CROPDMG*1000, ifelse(CROPDMGEXP == "M", CROPDMG*1000, CROPDMG)))
     economic = prop1 + crop1
      })
summary(stormdata$economic)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.00e+00 0.00e+00 0.00e+00 5.28e+02 1.00e+00 1.15e+08

Great - we can deduce from the summary on our computed “economic” variable that a small number of events are causing the vast majority of economic impact. Let’s take a closer look to see if that is correct:

econ <- aggregate(economic ~ EVTYPE, data=stormdata, FUN=sum)
econimpact <- arrange(econ, desc(economic))
head(econimpact, 10)
##               EVTYPE  economic
## 1              FLOOD 150319685
## 2  HURRICANE/TYPHOON  71913713
## 3            TORNADO  57352572
## 4        STORM SURGE  43323541
## 5               HAIL  18758572
## 6        FLASH FLOOD  17562686
## 7            DROUGHT  15018672
## 8          HURRICANE  14610229
## 9        RIVER FLOOD  10148404
## 10         ICE STORM   8967091
tot <- sum(econimpact$economic)
fiveworst <- sum(econimpact[1:5, 2])
flood <- econimpact[1,2]
flood/tot
## [1] 0.3155118
fiveworst/tot
## [1] 0.7171403

31.5% of all weather-related damage in the United States as recorded by this data set was caused by Floods, and 71.7% by the top 5 most damaging events. Interesting! Let’s graph the 5 worst offenders:

top5 <- econimpact[1:5,]
qplot(EVTYPE, data=top5, weight=economic, geom="bar", ylab="Cost, in thousands", main="Economic Impact of Weather Events, Top 5")

Conclusions

Based on the data analysis performed here with the Storm Data weather set, we conclude:
1. Tornados are the #1 cause of weather-related fatalities and injuries in the US. They are the #3 most expensive weather event.
2. Floods are the #1 cause of weather-related damage and expenses in the US. They are the #3 cause of fatalities.