Synopsis

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

This project explores the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

This data analysis addresses the following questions:

  1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
  2. Across the United States, which types of events have the greatest economic consequences?

Data Processing

Download the Data from the Coursera Data Science course site:

download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", "storms.csv.bz2", method="curl")
storms <- read.csv("storms.csv.bz2", header=TRUE)
str(storms)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","  N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels ""," Christiansburg",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels ""," CANTON"," TULIA",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels ""," CI","%SD",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436781 levels "","\t","\t\t",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

Remove data from before 1996, as mentioned from the NCDC documentation this older data is of lower quality.

storms$END_DATE = as.Date(storms$END_DATE, "%m/%d/%Y")
storms = subset(storms, END_DATE >= "1996-01-01")
dim(storms)
## [1] 653529     37

The following variables are important for this analysis:

Put all event types to lowercase:

storms$EVTYPE = tolower(storms$EVTYPE)

Remove the observations where there is no damage reported, nor fatalities or injuries:

storms = subset(storms, storms$CROPDMG != 0 | storms$PROPDMG != 0 | storms$FATALITIES != 0 | storms$INJURIES != 0)
dim(storms)
## [1] 201318     37

The NWS documentation refers to *DMGEXP values as empty, “K”, “M”, “B”. representing multipliers for the *DMG variables:

Check the *DMGEXP values

table(storms$PROPDMGEXP)
## 
##             -      ?      +      0      1      2      3      4      5 
##   8448      0      0      0      0      0      0      0      0      0 
##      6      7      8      B      h      H      K      m      M 
##      0      0      0     32      0      0 185474      0   7364
table(storms$CROPDMGEXP)
## 
##             ?      0      2      B      k      K      m      M 
## 102767      0      0      0      2      0  96787      0   1762

Calculate cost in billion dollars using the *DMG and *DMGEXP variables for property damage and crop damage:

# Property Damage Cost in billion USD, using PROPDMG and PROPDMGEXP variables
storms$PROPDMGCost = NA
storms$PROPDMGCost[storms$PROPDMGEXP == ""] = storms$PROPDMG / 10^9
storms$PROPDMGCost[storms$PROPDMGEXP == "K"] = storms$PROPDMG / 10^6
storms$PROPDMGCost[storms$PROPDMGEXP == "M"] = storms$PROPDMG / 10^3

# Crop Damage Cost in billion USD, using CROPDMG and CROPDMGEXP variables
storms$PROPDMGCost[storms$PROPDMGEXP == "B"] = storms$PROPDMG
storms$CROPDMGCost = NA
storms$CROPDMGCost[storms$CROPDMGEXP == ""] = storms$CROPDMG / 10^9
storms$CROPDMGCost[storms$CROPDMGEXP == "K"] = storms$CROPDMG / 10^6
storms$CROPDMGCost[storms$CROPDMGEXP == "M"] = storms$CROPDMG / 10^3
storms$CROPDMGCost[storms$CROPDMGEXP == "B"] = storms$CROPDMG

Summarise and calculate damage cost for both Property and Crop damage:

sum(storms$PROPDMGCost)
## [1] 2427.098
summary(storms$PROPDMGCost)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   0.0000   0.0000   0.0000   0.0121   0.0000 595.0000
sum(storms$CROPDMGCost)
## [1] 44.09622
summary(storms$CROPDMGCost)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.0e+00 0.0e+00 0.0e+00 2.2e-04 0.0e+00 3.8e+01

Results

# Use qcc for pareto charts
library(qcc)
## Package 'qcc', version 2.6
## Type 'citation("qcc")' for citing this R package in publications.

Most Harmful Event Types with Greatest Economic Consequences

Reported Property Damages are about 50 times higher than the Crop Damages.

Property Damage

Top 10 event types with highest property damage cost (in billion dollars):

top10prop = head(sort(tapply(storms$PROPDMGCost, storms$EVTYPE, sum), decreasing=TRUE), n=10)
top10prop.df = data.frame(eventType = names(top10prop), propDamage = round(top10prop), row.names = NULL)
top10prop.df
##            eventType propDamage
## 1              flood        674
## 2   storm surge/tide        595
## 3        flash flood        456
## 4          hurricane        258
## 5  hurricane/typhoon        142
## 6          high wind         92
## 7            tornado         69
## 8               hail         32
## 9          tstm wind         25
## 10          wildfire         18
pareto.chart(top10prop, main="US storm event types with greatest economic consequences (1996-2011)", ylab = "Property Damage (billion USD)")

##                    
## Pareto chart analysis for top10prop
##                     Frequency Cum.Freq. Percentage Cum.Percent.
##   flood             673.91126  673.9113 28.5418426     28.54184
##   storm surge/tide  595.20319 1269.1144 25.2083574     53.75020
##   flash flood       456.22914 1725.3436 19.3224554     73.07266
##   hurricane         257.66367 1983.0073 10.9127068     83.98536
##   hurricane/typhoon 142.16847 2125.1757  6.0211935     90.00656
##   high wind          91.73748 2216.9132  3.8853136     93.89187
##   tornado            68.79707 2285.7103  2.9137294     96.80560
##   hail               31.97069 2317.6810  1.3540396     98.15964
##   tstm wind          25.45813 2343.1391  1.0782159     99.23785
##   wildfire           17.99529 2361.1344  0.7621459    100.00000

Crop Damage

Top 10 event types with highest property damage cost (in billion dollars):

top10crop = head(sort(tapply(storms$CROPDMGCost, storms$EVTYPE, sum), decreasing=TRUE), n=10)
top10crop.df = data.frame(eventType = names(top10crop), cropDamage = round(top10crop, 2), row.names = NULL)
top10crop.df
##            eventType cropDamage
## 1  hurricane/typhoon      38.01
## 2               hail       1.91
## 3              flood       1.16
## 4          tstm wind       0.84
## 5            drought       0.78
## 6          high wind       0.28
## 7  thunderstorm wind       0.28
## 8        flash flood       0.22
## 9       extreme cold       0.15
## 10    tropical storm       0.10

Most Harmful Event Types with Respect to Population Health

Injuries are concidered here as the main measurement. Secondly data is presented for fatalities.

Injuries

Top 10 event types with highest number of injuries:

top10injuries = head(sort(tapply(storms$INJURIES/1000, storms$EVTYPE, sum), decreasing=TRUE), n=10)
top10injuries.df = data.frame(eventType = names(top10injuries), injuries = top10injuries, row.names = NULL)
top10injuries.df
##            eventType injuries
## 1            tornado   20.667
## 2              flood    6.758
## 3     excessive heat    6.391
## 4          lightning    4.141
## 5          tstm wind    3.629
## 6        flash flood    1.674
## 7  thunderstorm wind    1.400
## 8       winter storm    1.292
## 9  hurricane/typhoon    1.275
## 10              heat    1.222
pareto.chart(top10injuries, main="US storm event types most harmful for population health (1996-2011)", ylab="injuries (thousands)")

##                    
## Pareto chart analysis for top10injuries
##                     Frequency Cum.Freq. Percentage Cum.Percent.
##   tornado              20.667    20.667  42.657227     42.65723
##   flood                 6.758    27.425  13.948688     56.60592
##   excessive heat        6.391    33.816  13.191191     69.79711
##   lightning             4.141    37.957   8.547132     78.34424
##   tstm wind             3.629    41.586   7.490351     85.83459
##   flash flood           1.674    43.260   3.455180     89.28977
##   thunderstorm wind     1.400    44.660   2.889637     92.17941
##   winter storm          1.292    45.952   2.666722     94.84613
##   hurricane/typhoon     1.275    47.227   2.631633     97.47776
##   heat                  1.222    48.449   2.522240    100.00000

Fatalities

Top 10 event types with highest number of fatalities:

top10fatalities = head(sort(tapply(storms$FATALITIES, storms$EVTYPE, sum), decreasing=TRUE), n=10)
top10fatalities.df = data.frame(eventType = names(top10fatalities), fatalities = top10fatalities, row.names = NULL)
top10fatalities.df
##         eventType fatalities
## 1  excessive heat       1797
## 2         tornado       1511
## 3     flash flood        887
## 4       lightning        651
## 5           flood        414
## 6     rip current        340
## 7       tstm wind        241
## 8            heat        237
## 9       high wind        235
## 10      avalanche        223