This document is aimed to identify which types of storm and severe weather events are most harmful to public health and which types have the greatest economic consequences.
The analysis explores the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including place, date and estimates of any fatalities, injuries and property damage.
Documentation of the data base is available:
1. NATIONAL WEATHER SERVICE INSTRUCTION 1-1605
2. Storm Data FAQ Page
Instruction 1-165, issued on August 17, 2007, contains the updated events type table among some other useful information.
The events in the database start in the year 1950 and end in November 2011. In recent years there are more events recorded and more complete.
The analysis conclusion is that Tornado is the most harmful and damaging event type. Resources should be allocated prioritary to prepare for this event type.
Download the data file from the url provided.
if (!file.exists ("../files/StormData.bz2")){
fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(fileUrl, destfile = "../files/StormData.bz2", method = "wget")
}
The data come as a comma-separated-value file compressed via the bzip2 algorithm. It can be read as a csv file.
if (!exists("ds")){
ds <- read.csv("../files/StormData.bz2", stringsAsFactors = FALSE, na.strings = "")
}
Check the dataset dim and vars.
dim(ds)
## [1] 902297 37
str(ds)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr NA NA NA NA ...
## $ BGN_LOCATI: chr NA NA NA NA ...
## $ END_DATE : chr NA NA NA NA ...
## $ END_TIME : chr NA NA NA NA ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr NA NA NA NA ...
## $ END_LOCATI: chr NA NA NA NA ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr NA NA NA NA ...
## $ WFO : chr NA NA NA NA ...
## $ STATEOFFIC: chr NA NA NA NA ...
## $ ZONENAMES : chr NA NA NA NA ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr NA NA NA NA ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
Convert BGN_DATE to date.
ds$begin.date <- as.Date(ds$BGN_DATE, "%m/%d/%Y")
summary(ds$begin.date)
## Min. 1st Qu. Median Mean 3rd Qu.
## "1950-01-03" "1995-04-20" "2002-03-18" "1998-12-27" "2007-07-28"
## Max.
## "2011-11-30"
Summary of event types.
str(as.factor(ds$EVTYPE))
## Factor w/ 985 levels "?","ABNORMALLY DRY",..: 830 830 830 830 830 830 830 830 830 830 ...
summary(as.factor(ds$EVTYPE))
## HAIL TSTM WIND THUNDERSTORM WIND
## 288661 219940 82563
## TORNADO FLASH FLOOD FLOOD
## 60652 54277 25326
## THUNDERSTORM WINDS HIGH WIND LIGHTNING
## 20843 20212 15754
## HEAVY SNOW HEAVY RAIN WINTER STORM
## 15708 11723 11433
## WINTER WEATHER FUNNEL CLOUD MARINE TSTM WIND
## 7026 6839 6175
## MARINE THUNDERSTORM WIND WATERSPOUT STRONG WIND
## 5812 3796 3566
## URBAN/SML STREAM FLD WILDFIRE BLIZZARD
## 3392 2761 2719
## DROUGHT ICE STORM EXCESSIVE HEAT
## 2488 2006 1678
## HIGH WINDS WILD/FOREST FIRE FROST/FREEZE
## 1533 1457 1342
## DENSE FOG WINTER WEATHER/MIX TSTM WIND/HAIL
## 1293 1104 1028
## EXTREME COLD/WIND CHILL HEAT HIGH SURF
## 1002 767 725
## TROPICAL STORM FLASH FLOODING EXTREME COLD
## 690 682 655
## COASTAL FLOOD LAKE-EFFECT SNOW FLOOD/FLASH FLOOD
## 650 636 624
## LANDSLIDE SNOW COLD/WIND CHILL
## 600 587 539
## FOG RIP CURRENT MARINE HAIL
## 538 470 442
## DUST STORM AVALANCHE WIND
## 427 386 340
## RIP CURRENTS STORM SURGE FREEZING RAIN
## 304 261 250
## URBAN FLOOD HEAVY SURF/HIGH SURF EXTREME WINDCHILL
## 249 228 204
## STRONG WINDS DRY MICROBURST ASTRONOMICAL LOW TIDE
## 196 186 174
## HURRICANE RIVER FLOOD LIGHT SNOW
## 174 173 154
## STORM SURGE/TIDE RECORD WARMTH COASTAL FLOODING
## 148 146 143
## DUST DEVIL MARINE HIGH WIND UNSEASONABLY WARM
## 141 135 126
## FLOODING ASTRONOMICAL HIGH TIDE MODERATE SNOWFALL
## 120 103 101
## URBAN FLOODING WINTRY MIX HURRICANE/TYPHOON
## 98 90 88
## FUNNEL CLOUDS HEAVY SURF RECORD HEAT
## 87 84 81
## FREEZE HEAT WAVE COLD
## 74 74 72
## RECORD COLD ICE THUNDERSTORM WINDS HAIL
## 64 61 61
## TROPICAL DEPRESSION SLEET UNSEASONABLY DRY
## 60 59 56
## FROST GUSTY WINDS THUNDERSTORM WINDSS
## 53 53 51
## MARINE STRONG WIND OTHER SMALL HAIL
## 48 48 47
## FUNNEL FREEZING FOG THUNDERSTORM
## 46 45 45
## Temperature record TSTM WIND (G45) Coastal Flooding
## 43 39 38
## WATERSPOUTS MONTHLY PRECIPITATION WINDS
## 37 36 36
## (Other)
## 2940
There are many more events types in dataset than in published table. Some cleaning applied.
ds$evt <- toupper (ds$EVTYPE)
ds$evt <- gsub("^ *| *$", "", ds$evt)
ds$evt <- gsub(" {2,}", " ", ds$evt)
ds$evt <- gsub(" /|/ ", "", ds$evt)
ds$evt <- gsub("COLD$", "COLD/WIND CHILL", ds$evt)
ds$evt <- gsub("^URBAN", "FLOOD", ds$evt)
ds$evt <- gsub("^FLOOD(.)*", "FLOOD", ds$evt)
ds$evt <- gsub("^HEAT(.)+", "HEAT", ds$evt)
ds$evt <- gsub("^HIGH WIND(.)+", "HIGH WIND", ds$evt)
ds$evt <- gsub("^HURRICANE(.)*", "HURRICANE (TYPHOON)", ds$evt)
ds$evt <- gsub("^MARINE TSTM(.)*", "MARINE THUNDERSTORM WIND", ds$evt)
ds$evt <- gsub("^THUNDERSTORM(.)+", "THUNDERSTORM WIND", ds$evt)
ds$evt <- gsub("^TSTM WIND(.)*", "THUNDERSTORM WIND", ds$evt)
ds$evt <- gsub("^WILD/(.)+", "WILDFIRE", ds$evt)
ds$evt <- gsub("^WINTER WEATHER(.)+", "WINTER WEATHER", ds$evt)
Summary of event types after cleaning.
summary(as.factor(ds$evt))
## THUNDERSTORM WIND HAIL
## 324713 288661
## TORNADO FLASH FLOOD
## 60652 54278
## FLOOD HIGH WIND
## 29913 21802
## LIGHTNING HEAVY SNOW
## 15755 15708
## MARINE THUNDERSTORM WIND HEAVY RAIN
## 11987 11742
## WINTER STORM WINTER WEATHER
## 11433 8155
## FUNNEL CLOUD WILDFIRE
## 6844 4219
## WATERSPOUT STRONG WIND
## 3797 3569
## BLIZZARD DROUGHT
## 2719 2488
## ICE STORM EXCESSIVE HEAT
## 2006 1678
## EXTREME COLD/WIND CHILL FROST/FREEZE
## 1659 1343
## DENSE FOG HEAT
## 1293 848
## HIGH SURF TROPICAL STORM
## 734 690
## FLASH FLOODING COASTAL FLOOD
## 682 657
## LAKE-EFFECT SNOW COLD/WIND CHILL
## 636 621
## SNOW LANDSLIDE
## 617 600
## FOG RIP CURRENT
## 538 470
## MARINE HAIL DUST STORM
## 442 427
## AVALANCHE WIND
## 386 347
## RIP CURRENTS HURRICANE (TYPHOON)
## 304 288
## STORM SURGE FREEZING RAIN
## 261 260
## HEAVY SURF/HIGH SURF EXTREME WINDCHILL
## 228 204
## STRONG WINDS DRY MICROBURST
## 204 186
## COASTAL FLOODING LIGHT SNOW
## 183 176
## ASTRONOMICAL LOW TIDE RIVER FLOOD
## 174 173
## RECORD WARMTH DUST DEVIL
## 154 149
## STORM SURGE/TIDE MARINE HIGH WIND
## 148 135
## UNSEASONABLY WARM ASTRONOMICAL HIGH TIDE
## 126 103
## MODERATE SNOWFALL WINTRY MIX
## 101 94
## FUNNEL CLOUDS HEAVY SURF
## 87 87
## RECORD HEAT FREEZE
## 82 76
## RECORD COLD/WIND CHILL GUSTY WINDS
## 68 65
## ICE TROPICAL DEPRESSION
## 61 60
## SLEET FROST
## 59 57
## UNSEASONABLY DRY SMALL HAIL
## 56 53
## OTHER MARINE STRONG WIND
## 52 48
## FREEZING FOG FUNNEL
## 46 46
## THUNDERSTORM GLAZE
## 45 43
## TEMPERATURE RECORD MIXED PRECIPITATION
## 43 37
## WATERSPOUTS MONTHLY PRECIPITATION
## 37 36
## WINDS SNOW AND ICE
## 36 34
## FLASH FLOODS HEAVY SNOW SQUALLS
## 32 32
## ICY ROADS WIND DAMAGE
## 32 31
## HAIL 75 RIVER FLOODING
## 29 29
## HEAVY RAINS EXCESSIVE SNOW
## 26 25
## HEAVY LAKE SNOW TIDAL FLOODING
## 25 25
## FREEZING DRIZZLE GUSTY WIND
## 24 24
## LAKE EFFECT SNOW LAKESHORE FLOOD
## 23 23
## LIGHT FREEZING RAIN SEVERE THUNDERSTORMS
## 23 23
## UNSEASONABLY COLD/WIND CHILL (Other)
## 23 1774
Still too many event types. It seems to be sensible to select events with frequency > 0.1%
min_freq <- length(ds$evt)[1] * 0.001
evt_freq <- tapply(ds$evt, ds$evt, length)
events <- names(evt_freq[evt_freq > min_freq])
table(ds$evt[ds$evt %in% events])
##
## BLIZZARD DENSE FOG DROUGHT
## 2719 1293 2488
## EXCESSIVE HEAT EXTREME COLD/WIND CHILL FLASH FLOOD
## 1678 1659 54278
## FLOOD FROST/FREEZE FUNNEL CLOUD
## 29913 1343 6844
## HAIL HEAVY RAIN HEAVY SNOW
## 288661 11742 15708
## HIGH WIND ICE STORM LIGHTNING
## 21802 2006 15755
## MARINE THUNDERSTORM WIND STRONG WIND THUNDERSTORM WIND
## 11987 3569 324713
## TORNADO WATERSPOUT WILDFIRE
## 60652 3797 4219
## WINTER STORM WINTER WEATHER
## 11433 8155
Filter the dataset with events and select cols needed for analysis.
ds1 <- ds[ds$evt %in% events, c("begin.date", "evt", "FATALITIES", "INJURIES", "PROPDMG",
"PROPDMGEXP", "CROPDMG", "CROPDMGEXP")]
summary(ds1)
## begin.date evt FATALITIES
## Min. :1950-01-03 Length:886414 Min. : 0.00000
## 1st Qu.:1995-03-19 Class :character 1st Qu.: 0.00000
## Median :2002-03-29 Mode :character Median : 0.00000
## Mean :1998-12-05 Mean : 0.01362
## 3rd Qu.:2007-07-27 3rd Qu.: 0.00000
## Max. :2011-11-30 Max. :158.00000
## INJURIES PROPDMG PROPDMGEXP
## Min. : 0.0000 Min. : 0.00 Length:886414
## 1st Qu.: 0.0000 1st Qu.: 0.00 Class :character
## Median : 0.0000 Median : 0.00 Mode :character
## Mean : 0.1494 Mean : 11.95
## 3rd Qu.: 0.0000 3rd Qu.: 0.50
## Max. :1700.0000 Max. :5000.00
## CROPDMG CROPDMGEXP
## Min. : 0.000 Length:886414
## 1st Qu.: 0.000 Class :character
## Median : 0.000 Mode :character
## Mean : 1.506
## 3rd Qu.: 0.000
## Max. :990.000
Check magnitudes of damages.
table(ds1$PROPDMGEXP, useNA = "ifany")
##
## - ? + 0 1 2 3 4 5 6
## 1 8 4 215 23 13 4 4 28 4
## 7 8 B h H K m M <NA>
## 5 1 14 1 6 417736 6 10793 457548
table(ds1$CROPDMGEXP, useNA = "ifany")
##
## ? 0 2 B k K M <NA>
## 6 19 1 5 19 276613 1826 607925
Proportion of missing values is high. Check amounts with missing magnitude.
table(ds1$PROPDMG[is.na(ds1$PROPDMGEXP)])
##
## 0 1 2 3 4 5 6 7 8 9
## 457476 4 7 16 8 11 5 3 2 3
## 10 20 35 75
## 7 4 1 1
table(ds1$CROPDMG[is.na(ds1$CROPDMGEXP)])
##
## 0 3 4
## 607922 1 2
Most of them are 0.
Transform magnitues to figures. Million as unit.
ds1$pdmg[ds1$PROPDMGEXP %in% c("k", "K")] <- 0.001
ds1$pdmg[ds1$PROPDMGEXP %in% c("m", "M")] <- 1
ds1$pdmg[ds1$PROPDMGEXP %in% c("b", "B")] <- 1000
ds1$pdmg[is.na(ds1$pdmg)] <- 0
ds1$cdmg[ds1$CROPDMGEXP %in% c("k", "K")] <- 0.001
ds1$cdmg[ds1$CROPDMGEXP %in% c("m", "M")] <- 1
ds1$cdmg[ds1$CROPDMGEXP %in% c("b", "B")] <- 1000
ds1$cdmg[is.na(ds1$cdmg)] <- 0
Obtain same magnitude and total amount of damage per event.
ds1$pd <- ds1$PROPDMG * ds1$pdmg
summary(ds1$pd)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00e+00 0.00e+00 0.00e+00 3.10e-01 0.00e+00 1.15e+05
ds1$cd <- ds1$CROPDMG * ds1$cdmg
summary(ds1$cd)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0e+00 0.0e+00 0.0e+00 4.1e-02 0.0e+00 5.0e+03
ds1$td <- ds1$pd + ds1$cd
summary(ds1$td)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00e+00 0.00e+00 0.00e+00 3.50e-01 0.00e+00 1.15e+05
ds1$tif <- ds1$FATALITIES + ds1$INJURIES
head(ds1)
## begin.date evt FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG
## 1 1950-04-18 TORNADO 0 15 25.0 K 0
## 2 1950-04-18 TORNADO 0 0 2.5 K 0
## 3 1951-02-20 TORNADO 0 2 25.0 K 0
## 4 1951-06-08 TORNADO 0 2 2.5 K 0
## 5 1951-11-15 TORNADO 0 2 2.5 K 0
## 6 1951-11-15 TORNADO 0 6 2.5 K 0
## CROPDMGEXP pdmg cdmg pd cd td tif
## 1 <NA> 0.001 0 0.0250 0 0.0250 15
## 2 <NA> 0.001 0 0.0025 0 0.0025 0
## 3 <NA> 0.001 0 0.0250 0 0.0250 2
## 4 <NA> 0.001 0 0.0025 0 0.0025 2
## 5 <NA> 0.001 0 0.0025 0 0.0025 2
## 6 <NA> 0.001 0 0.0025 0 0.0025 6
Make some plots to see results.
Filter events with fatalities and injuries.
ds2 <- ds1[ds1$tif > 0, c("begin.date", "evt", "FATALITIES", "INJURIES", "tif")]
Scatter plot.
qplot(evt, tif, data = ds2, log = "y", xlab = "Event", ylab = "log(People)",
main = "Fatalities and injuries by event type", geom = "jitter") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
The result is that Tornado is by far the most harmful event type in terms of fatalities and injuries.
Filter events with economic consequences.
ds3 <- ds1[ds1$td > 0, c("begin.date", "evt", "pd", "cd", "td")]
Scatter plot.
qplot(evt, td, data = ds3, log = "y", xlab = "Event", ylab = "log(US$ in millions)",
main = "Economic damage by event type", geom = "jitter", alpha = I(1/10)) +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
There is one outlier of type flood that distort the result by totals.
summary(ds3$td[ds3$evt == "FLOOD"])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.01 0.05 13.35 0.30 115000.00
ds3[ds3$td == max(ds3$td), ]
## begin.date evt pd cd td
## 605953 2006-01-01 FLOOD 115000 32.5 115032.5
The result is that Tornado is the event type that causes greatest economic damage. Floods, hail and thunderstorm wind are also very damaging.