Synopsis

In this report we aim to describe the most harmful types of the weather severe events with respect to population health and greatest economic consequences in the United States between the years 1950 and November 2011. There is no overall hypothesis. It is a descriptive analysis. We use all available data about storms from the U.S. National Oceanic and Atmospheric Administration’s (NOAA).

Loading and Processing the Raw Data

The data for this research is public available:

There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.

The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.

I use download method “curl” in accordance with OS X specific:

sessionInfo()
## R version 3.2.0 (2015-04-16)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.10.3 (Yosemite)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
## [1] formatR_1.2     tools_3.2.0     htmltools_0.2.6 yaml_2.1.13    
## [5] rmarkdown_0.5.1 knitr_1.10      stringr_0.6.2   digest_0.6.8   
## [9] evaluate_0.7

Download file and read raw data:

fileLocation = "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
fileDest <- "stormData.csv.bz2"
download.file(fileLocation, fileDest, method = "curl")
stormsData <- read.csv(bzfile(fileDest))

As you can see the dataset is really big:

format(object.size(stormsData), units = "Mb")
## [1] "409.4 Mb"
head(stormsData)
##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL
## 6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL
##    EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO         0                                               0
## 2 TORNADO         0                                               0
## 3 TORNADO         0                                               0
## 4 TORNADO         0                                               0
## 5 TORNADO         0                                               0
## 6 TORNADO         0                                               0
##   COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1         NA         0                      14.0   100 3   0          0
## 2         NA         0                       2.0   150 2   0          0
## 3         NA         0                       0.1   123 2   0          0
## 4         NA         0                       0.0   100 2   0          0
## 5         NA         0                       0.0   150 2   0          0
## 6         NA         0                       1.5   177 2   0          0
##   INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1       15    25.0          K       0                                    
## 2        0     2.5          K       0                                    
## 3        2    25.0          K       0                                    
## 4        2     2.5          K       0                                    
## 5        2     2.5          K       0                                    
## 6        6     2.5          K       0                                    
##   LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1     3040      8812       3051       8806              1
## 2     3042      8755          0          0              2
## 3     3340      8742          0          0              3
## 4     3458      8626          0          0              4
## 5     3412      8642          0          0              5
## 6     3450      8748          0          0              6

For the calculating damage amount we need to normalize it values:

normalizeValue <- function(power, value) {
  power <- toupper(power)
  if (power == "K") {return(value*1000)}
  if (power == "M") {return(value*1000000)}
  if (power == "B") {return(value*1000000000)}
  return(value)
}

stormsData$PROPDMG_VALUE <- normalizeValue(stormsData$PROPDMGEXP, stormsData$PROPDMG)
stormsData$CROPDMG_VALUE <- normalizeValue(stormsData$CROPDMGEXP, stormsData$CROPDMG)

Results

During this section I will try to answer 2 questions:

  1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
  2. Across the United States, which types of events have the greatest economic consequences?

The types of events are most harmful with respect to population health

Calculate full amount of fatalities by weather events:

library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
fatalitiesByEvent <- aggregate(FATALITIES ~ EVTYPE, stormsData, sum)
fatalitiesByEvent <- filter(fatalitiesByEvent, FATALITIES > 0)
fatalitiesByEvent <- arrange(fatalitiesByEvent, -FATALITIES)

Only first 20 records have more then 100 fatalities for all times:

fatalitiesByEvent[1:20,]
##                     EVTYPE FATALITIES
## 1                  TORNADO       5633
## 2           EXCESSIVE HEAT       1903
## 3              FLASH FLOOD        978
## 4                     HEAT        937
## 5                LIGHTNING        816
## 6                TSTM WIND        504
## 7                    FLOOD        470
## 8              RIP CURRENT        368
## 9                HIGH WIND        248
## 10               AVALANCHE        224
## 11            WINTER STORM        206
## 12            RIP CURRENTS        204
## 13               HEAT WAVE        172
## 14            EXTREME COLD        160
## 15       THUNDERSTORM WIND        133
## 16              HEAVY SNOW        127
## 17 EXTREME COLD/WIND CHILL        125
## 18             STRONG WIND        103
## 19                BLIZZARD        101
## 20               HIGH SURF        101
par(mar=c(5.1,6.1,4.1,2.1))
barplot(fatalitiesByEvent[10:1, "FATALITIES"], 
        horiz = TRUE, 
        las = 1, 
        names.arg = fatalitiesByEvent[10:1, "EVTYPE"], 
        cex.names = 0.5, 
        main="Top 10 fatality causing events",
        xlab="Number of fatalities")

Calculate full amount of injuries by weather events:

injuriesByEvent <- aggregate(INJURIES ~ EVTYPE, stormsData, sum)
injuriesByEvent <- filter(injuriesByEvent, INJURIES > 0)
injuriesByEvent <- arrange(injuriesByEvent, -INJURIES)

Only first 14 records have more then 1000 injuries for all times:

injuriesByEvent[1:14,]
##               EVTYPE INJURIES
## 1            TORNADO    91346
## 2          TSTM WIND     6957
## 3              FLOOD     6789
## 4     EXCESSIVE HEAT     6525
## 5          LIGHTNING     5230
## 6               HEAT     2100
## 7          ICE STORM     1975
## 8        FLASH FLOOD     1777
## 9  THUNDERSTORM WIND     1488
## 10              HAIL     1361
## 11      WINTER STORM     1321
## 12 HURRICANE/TYPHOON     1275
## 13         HIGH WIND     1137
## 14        HEAVY SNOW     1021
par(mar=c(5.1,6.1,4.1,2.1))
barplot(injuriesByEvent[10:1, "INJURIES"], 
        horiz = TRUE, 
        las = 1, 
        names.arg = injuriesByEvent[10:1, "EVTYPE"], 
        cex.names = 0.6, 
        main="Top 10 injuries causing events",
        xlab="Number of injuries")

As you can see the biggest damage is to tornado.
It is a very strong correlation between injuries and fatalities reasons:

comparison <- merge(fatalitiesByEvent, injuriesByEvent)
cor(comparison$FATALITIES, comparison$INJURIES)
## [1] 0.9453787

The types of events that have the greatest economic consequences

For economic impact events I will use the same analysis type as in previous section.

stormsData <- mutate(stormsData, CONSEQUENCES = PROPDMG_VALUE + CROPDMG_VALUE)
consequenceByEvent <- aggregate(CONSEQUENCES ~ EVTYPE, stormsData, sum)
consequenceByEvent <- filter(consequenceByEvent, CONSEQUENCES > 0)
consequenceByEvent <- arrange(consequenceByEvent, -CONSEQUENCES)

“Top 20” by the greatest economic consequences:

head(consequenceByEvent, n = 20)
##                  EVTYPE CONSEQUENCES
## 1               TORNADO   3212358179
## 2           FLASH FLOOD   1420303790
## 3             TSTM WIND   1336074813
## 4                 FLOOD    900106518
## 5     THUNDERSTORM WIND    876910961
## 6                  HAIL    689272976
## 7             LIGHTNING    603355361
## 8    THUNDERSTORM WINDS    446311865
## 9             HIGH WIND    324748843
## 10         WINTER STORM    132722569
## 11           HEAVY SNOW    122254156
## 12             WILDFIRE     84463704
## 13            ICE STORM     66002359
## 14          STRONG WIND     62995427
## 15           HIGH WINDS     55626760
## 16           HEAVY RAIN     50853263
## 17       TROPICAL STORM     48429579
## 18     WILD/FOREST FIRE     39349140
## 19       FLASH FLOODING     28502276
## 20 URBAN/SML STREAM FLD     26054734
par(mar=c(5.1,6.1,4.1,2.1))
barplot((consequenceByEvent[10:1, "CONSEQUENCES"])/1000000,
        horiz = TRUE, 
        names.arg = consequenceByEvent[10:1, "EVTYPE"], 
        cex.names = 0.5, 
        las = 1, 
        xlab = "Total Damage (in millions $)", 
        main = "Top 10 economic damage causing events")

As we can see, again, tornado at the first place.