This document is an exploratory review of the NOAA Storm Database and addresses the following key questions:
The dataset used for analysis is from the NOAA strom tracking database
In addition there is a FAQ.
Load the data and parse it, this will download the compressed file and parse into R, if the file has already been downloaded locally it will skip that step.
Lets go ahead and dump out a summary of EVTYPE since this will best answer our question to categorize by event type.
URL <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
localZipFile <- "StormData.bz2"
if (!file.exists(localZipFile)) {
download.file(URL, destfile=localZipFile)
}
# this bz2 jas 500 MB unzipped on disk, takes about 80s on my PC
system.time(stormdata <- read.csv(bzfile(localZipFile), header=TRUE))
## user system elapsed
## 78.58 0.30 79.05
# we know the date is easy to reformat
stormdata$date <- mdy_hms(stormdata$BGN_DATE)
summary(stormdata$EVTYPE)
## HAIL TSTM WIND THUNDERSTORM WIND
## 288661 219940 82563
## TORNADO FLASH FLOOD FLOOD
## 60652 54277 25326
## THUNDERSTORM WINDS HIGH WIND LIGHTNING
## 20843 20212 15754
## HEAVY SNOW HEAVY RAIN WINTER STORM
## 15708 11723 11433
## WINTER WEATHER FUNNEL CLOUD MARINE TSTM WIND
## 7026 6839 6175
## MARINE THUNDERSTORM WIND WATERSPOUT STRONG WIND
## 5812 3796 3566
## URBAN/SML STREAM FLD WILDFIRE BLIZZARD
## 3392 2761 2719
## DROUGHT ICE STORM EXCESSIVE HEAT
## 2488 2006 1678
## HIGH WINDS WILD/FOREST FIRE FROST/FREEZE
## 1533 1457 1342
## DENSE FOG WINTER WEATHER/MIX TSTM WIND/HAIL
## 1293 1104 1028
## EXTREME COLD/WIND CHILL HEAT HIGH SURF
## 1002 767 725
## TROPICAL STORM FLASH FLOODING EXTREME COLD
## 690 682 655
## COASTAL FLOOD LAKE-EFFECT SNOW FLOOD/FLASH FLOOD
## 650 636 624
## LANDSLIDE SNOW COLD/WIND CHILL
## 600 587 539
## FOG RIP CURRENT MARINE HAIL
## 538 470 442
## DUST STORM AVALANCHE WIND
## 427 386 340
## RIP CURRENTS STORM SURGE FREEZING RAIN
## 304 261 250
## URBAN FLOOD HEAVY SURF/HIGH SURF EXTREME WINDCHILL
## 249 228 204
## STRONG WINDS DRY MICROBURST ASTRONOMICAL LOW TIDE
## 196 186 174
## HURRICANE RIVER FLOOD LIGHT SNOW
## 174 173 154
## STORM SURGE/TIDE RECORD WARMTH COASTAL FLOODING
## 148 146 143
## DUST DEVIL MARINE HIGH WIND UNSEASONABLY WARM
## 141 135 126
## FLOODING ASTRONOMICAL HIGH TIDE MODERATE SNOWFALL
## 120 103 101
## URBAN FLOODING WINTRY MIX HURRICANE/TYPHOON
## 98 90 88
## FUNNEL CLOUDS HEAVY SURF RECORD HEAT
## 87 84 81
## FREEZE HEAT WAVE COLD
## 74 74 72
## RECORD COLD ICE THUNDERSTORM WINDS HAIL
## 64 61 61
## TROPICAL DEPRESSION SLEET UNSEASONABLY DRY
## 60 59 56
## FROST GUSTY WINDS THUNDERSTORM WINDSS
## 53 53 51
## MARINE STRONG WIND OTHER SMALL HAIL
## 48 48 47
## FUNNEL FREEZING FOG THUNDERSTORM
## 46 45 45
## Temperature record TSTM WIND (G45) Coastal Flooding
## 43 39 38
## WATERSPOUTS MONTHLY PRECIPITATION WINDS
## 37 36 36
## (Other)
## 2940
We can see this will need to be normalized, since there are a great variety of levels for this attribute. The following code will go through and use grepl to match the patterns and create a new feature called catType to help normalize these event types.
#
# levelfactor(f, dict(code,pattern))
# returns factor that is simplified
levelfactor <- function(f, myDict) {
g <- as.character(f)
levels(g) <- c(levels(g), levels(myDict$code))
# unfortanely at this time I can think of no
# better way than loop through the rules
# ... at least it's hidden in this function
for(i in 1:nrow(myDict)) {
g[grepl(myDict[i,2], g, ignore.case=TRUE)] <- myDict[i,1]
}
as.factor(as.character(g))
}
# there's got to be a better way to do this
dict <- data.frame(code="RAIN", pattern="RAIN", stringsAsFactors=FALSE)
dict[nrow(dict)+1,] <- list(code="WIND", pattern="WIND|WND")
dict[nrow(dict)+1,] <- list(code="FLOOD", pattern="FLO?O?O?D")
dict[nrow(dict)+1,] <- list(code="FLOOD", pattern="TSUMANI|RISING WATER")
dict[nrow(dict)+1,] <- list(code="SNOW", pattern="SNOW|WINTER|WINTRY")
dict[nrow(dict)+1,] <- list(code="RAIN", pattern="SHOWER|MICROURST|BURST|HEAVY MIX")
dict[nrow(dict)+1,] <- list(code="FIRE", pattern="FIRE|SMOKE")
dict[nrow(dict)+1,] <- list(code="HEAT", pattern="HEAT|HIGH|HOT|WARM|EXCESSIVE")
dict[nrow(dict)+1,] <- list(code="COLD", pattern="COLD|FREEZ|FROST|LOW|COOL")
dict[nrow(dict)+1,] <- list(code="FOG", pattern="FOG")
dict[nrow(dict)+1,] <- list(code="HAIL", pattern="HAIL")
dict[nrow(dict)+1,] <- list(code="WATERSPOUT", pattern="SPOUT|FUNNEL")
dict[nrow(dict)+1,] <- list(code="HURRICANE", pattern="HURRICANE|TYPHOON")
dict[nrow(dict)+1,] <- list(code="TORNADO", pattern="TORNADO|TORNDAO")
dict[nrow(dict)+1,] <- list(code="FOG", pattern="FOG")
dict[nrow(dict)+1,] <- list(code="THUNDERSTORM", pattern="THUNDER|TSTM")
dict[nrow(dict)+1,] <- list(code="LIGHTNING", pattern="LIG[HN]TN?ING")
dict[nrow(dict)+1,] <- list(code="TIDE", pattern="TIDE|SURF|WAVE|RIP CURRENT")
dict[nrow(dict)+1,] <- list(code="ICE", pattern="IC[EY]")
dict[nrow(dict)+1,] <- list(code="TEMPERATURE", pattern="TEMPERATURE|EXPOSURE")
dict[nrow(dict)+1,] <- list(code="DUST DEVIL", pattern="DUST DEVIL")
dict[nrow(dict)+1,] <- list(code="OTHER", pattern="OTHER|summary|NONE|[?]")
dict[nrow(dict)+1,] <- list(code="MARINE", pattern="MARINE|SEAS|DROWNING|SWELLS")
dict[nrow(dict)+1,] <- list(code="LANDSLIDE", pattern="SLIDE|LANDSLUMP")
dict[nrow(dict)+1,] <- list(code="PRECIPITATION", pattern="PRECIP")
dict[nrow(dict)+1,] <- list(code="SLEET", pattern="SLEET")
dict[nrow(dict)+1,] <- list(code="GLAZE", pattern="GLAZE")
dict[nrow(dict)+1,] <- list(code="DRY", pattern="DRY|DRIEST")
dict[nrow(dict)+1,] <- list(code="DUST", pattern="DUST")
dict[nrow(dict)+1,] <- list(code="TROPICAL STORM", pattern="TROPICAL STORM")
dict[nrow(dict)+1,] <- list(code="GUSTNADO", pattern="GUSTNADO")
dict[nrow(dict)+1,] <- list(code="VOLCANIC ASH", pattern="VOLCANIC")
dict[nrow(dict)+1,] <- list(code="WET", pattern="WET")
dict[nrow(dict)+1,] <- list(code="COASTAL STORM", pattern="COASTAL")
dict[nrow(dict)+1,] <- list(code="URBAN/SMALL STREAM", pattern="SMALL|STREAM")
dict[nrow(dict)+1,] <- list(code="BEACH EROSION", pattern="BEACH")
dict[nrow(dict)+1,] <- list(code="AVALANCHE", pattern="AVALANCH?E")
dict[nrow(dict)+1,] <- list(code="OTHER", pattern="Metro Storm|No Severe|NORTHERN LIGHTS|SOUTHEAST")
dict[nrow(dict)+1,] <- list(code="OTHER", pattern="APACHE COUNTY|MILD PATTERN|VOG")
dict[nrow(dict)+1,] <- list(code="DAM", pattern="DAM")
dict[nrow(dict)+1,] <- list(code="HURRICANE", pattern="REMNANTS OF FLOYD|RED FLAG")
dict[nrow(dict)+1,] <- list(code="CLOUD", pattern="CLOUD")
dict[nrow(dict)+1,] <- list(code="BLIZZARD", pattern="BLIZZARD")
dict[nrow(dict)+1,] <- list(code="WIND", pattern="TURBULENCE")
stormdata$catType <- levelfactor(stormdata$EVTYPE, dict)
summary(stormdata$catType)
## AVALANCHE BEACH EROSION BLIZZARD
## 387 5 2722
## CLOUD COASTAL STORM COLD
## 11 14 2690
## DAM DROUGHT DRY
## 5 2488 35
## DUST FIRE FLOOD
## 582 4261 86091
## FOG GLAZE GUSTNADO
## 1836 43 7
## HAIL HEAT HURRICANE
## 289276 4109 302
## ICE LANDSLIDE LIGHTNING
## 2131 648 15763
## MARINE OTHER PRECIPITATION
## 85 139 91
## RAIN SEICHE SLEET
## 12463 21 71
## SNOW STORM SURGE TEMPERATURE
## 37327 261 75
## THUNDERSTORM TIDE TORNADO
## 95 1018 60685
## TROPICAL DEPRESSION TROPICAL STORM TSUNAMI
## 60 697 20
## URBAN/SMALL STREAM VOLCANIC ASH WATERSPOUT
## 18 29 10845
## WET WIND
## 11 364880
As you can see, we still have a lot of categories, but this is greatly reduced, let’s look at the top 10 by incident count
stormdata %>%
group_by(catType) %>%
summarize(count=n()) %>%
arrange(desc(count))%>%
slice(1:10) %>%
ungroup()
## # A tibble: 10 x 2
## catType count
## <fctr> <int>
## 1 WIND 364880
## 2 HAIL 289276
## 3 FLOOD 86091
## 4 TORNADO 60685
## 5 SNOW 37327
## 6 LIGHTNING 15763
## 7 RAIN 12463
## 8 WATERSPOUT 10845
## 9 FIRE 4261
## 10 HEAT 4109
# go ahead and re-order the factors based on incident count
factorOrder <- group_by(stormdata, catType) %>%
summarize(count=n()) %>%
arrange(desc(count))
stormdata$catType <- factor(stormdata$catType, levels(stormdata$catType)[factorOrder$catType])
We can confirm the normalization with the following code, since there are still around 50 categories, we will only display the groups for the top 3 categories in this document
unique(droplevels(stormdata[stormdata$catType == "WIND",]$EVTYPE))
## [1] TSTM WIND HURRICANE OPAL/HIGH WINDS
## [3] THUNDERSTORM WINDS THUNDERSTORM WIND
## [5] HIGH WINDS THUNDERSTORM WINDS LIGHTNING
## [7] THUNDERSTORM WINDS/HAIL WIND
## [9] THUNDERSTORM WINDS HAIL HIGH WIND
## [11] WIND CHILL HIGH WIND/BLIZZARD
## [13] HIGH WIND AND HIGH TIDES HIGH WIND/BLIZZARD/FREEZING RA
## [15] HIGH WIND AND HEAVY SNOW RECORD COLD AND HIGH WIND
## [17] HIGH WIND/ BLIZZARD BLIZZARD/HIGH WIND
## [19] HIGH WIND/LOW WIND CHILL HIGH WINDS AND WIND CHILL
## [21] HEAVY SNOW/HIGH WINDS/FREEZING WIND CHILL/HIGH WIND
## [23] HIGH WIND/WIND CHILL/BLIZZARD HIGH WIND/WIND CHILL
## [25] HIGH WIND/HEAVY SNOW HIGH WIND/SEAS
## [27] SEVERE TURBULENCE HEAVY SNOW/WIND
## [29] WIND DAMAGE THUNDERSTORM WINDS/FUNNEL CLOU
## [31] WINTER STORM/HIGH WIND WINTER STORM/HIGH WINDS
## [33] GUSTY WINDS STRONG WINDS
## [35] SNOW AND WIND HIGH WINDS DUST STORM
## [37] WINTER STORM HIGH WINDS SEVERE THUNDERSTORM WINDS
## [39] THUNDERSTORMS WINDS WINDS
## [41] FLASH FLOOD WINDS STRONG WIND
## [43] HIGH WIND DAMAGE DOWNBURST WINDS
## [45] DRY MICROBURST WINDS DRY MIRCOBURST WINDS
## [47] MICROBURST WINDS HIGH WINDS 57
## [49] HIGH WINDS 66 HIGH WINDS 76
## [51] HIGH WINDS 63 HIGH WINDS 67
## [53] HEAVY SNOW/HIGH WINDS HIGH WINDS 82
## [55] HIGH WINDS 80 HIGH WINDS 58
## [57] LIGHTNING THUNDERSTORM WINDSS HIGH WINDS 73
## [59] HIGH WINDS 55 THUNDERSTORM WINDS 60
## [61] THUNDERSTORM WINDSS HIGH WINDS/FLOODING
## [63] TORNADOES, TSTM WIND, HAIL LIGHTNING THUNDERSTORM WINDS
## [65] THUNDERSTORM WINDS53 THUNDERSTORM WINDS 13
## [67] HEAVY SNOW/HIGH WIND HIGH WINDS/
## [69] EXTREME WIND CHILLS HIGH WINDS
## [71] EXTREME WIND CHILL GRADIENT WINDS
## [73] THUNDERSTORM WINDS URBAN FLOOD THUNDERSTORM WINDS SMALL STREA
## [75] BLOWING SNOW- EXTREME WIND CHI SNOW- HIGH WIND- WIND CHILL
## [77] THUNDERSTORM WINDS 2 TSTM WIND 51
## [79] TSTM WIND 50 TSTM WIND 52
## [81] TSTM WIND 55 THUNDERSTORM WINDS 61
## [83] THUNDERTORM WINDS HAIL/WINDS
## [85] WIND STORM HAIL/WIND
## [87] WIND/HAIL THUNDERSTORMS WIND
## [89] THUNDERSTORM WINDS TUNDERSTORM WIND
## [91] THUNDERTSORM WIND THUNDERSTORM WINDS/ HAIL
## [93] THUNDERSTORM WIND/LIGHTNING THUNDESTORM WINDS
## [95] HIGH WIND 63 HIGH WINDS/COASTAL FLOOD
## [97] THUNDERSTORM WIND G50 THUNDERSTROM WINDS
## [99] THUNDERSTORM WINDS LE CEN BLIZZARD AND EXTREME WIND CHIL
## [101] LOW WIND CHILL BLOWING SNOW & EXTREME WIND CH
## [103] THUNDERSTORM WINDS G DUST STORM/HIGH WINDS
## [105] THUNDERSTORM WIND G60 THUNDERSTORM WINDS.
## [107] THUNDERSTORM WIND G55 THUNDERSTORM WINDS G60
## [109] THUNDERSTORM WINDS FUNNEL CLOU THUNDERSTORM WINDS 62
## [111] HEAVY SNOW AND HIGH WINDS HEAVY SNOW/HIGH WINDS & FLOOD
## [113] THUNDERSTORM WINDS/FLASH FLOOD HIGH WIND 70
## [115] THUNDERSTORM WINDS 53 THUNDERSTORM WIND 59
## [117] THUNDERSTORM WIND 52 THUNDERSTORM WIND 69
## [119] LIGHTNING AND WINDS TSTM WIND G58
## [121] THUNDERSTORMW WINDS THUNDERSTORM WIND 60 MPH
## [123] THUNDERSTORM WIND 65MPH THUNDERSTORM WIND/ TREES
## [125] THUNDERSTORM WIND/AWNING THUNDERSTORM WIND 98 MPH
## [127] THUNDERSTORM WIND TREES THUNDERSTORM WIND 59 MPH
## [129] THUNDERSTORM WINDS 63 MPH THUNDERSTORM WIND/ TREE
## [131] THUNDERSTORM WIND 65 MPH THUNDERSTORM WIND.
## [133] THUNDERSTORM WIND 59 MPH. THUNDERSTORM WINDSHAIL
## [135] THUDERSTORM WINDS STORM FORCE WINDS
## [137] THUNDERSTORM WINDS AND TSTM WIND DAMAGE
## [139] THUNDERSTORM WINDS 50 THUNDERSTORM WIND G52
## [141] THUNDERSTORM WINDS 52 THUNDERSTORM WIND G51
## [143] THUNDERSTORM WIND G61 THUNDERESTORM WINDS
## [145] THUNDERSTORM WINDS/FLOODING THUNDEERSTORM WINDS
## [147] THUNDERSTORM WIND 50 THUNERSTORM WINDS
## [149] HIGH WINDS/COLD COLD/WINDS
## [151] THUNDERSTORM WIND 56 ICE/STRONG WINDS
## [153] EXTREME WIND CHILL/BLOWING SNO SNOW/HIGH WINDS
## [155] HIGH WINDS/SNOW HEAVY SNOW AND STRONG WINDS
## [157] BLOWING SNOW/EXTREME WIND CHIL THUNDERSTORM WIND/HAIL
## [159] TSTM WINDS TSTM WIND 65)
## [161] THUNDERSTORM WINDS/ FLOOD HIGH WIND AND SEAS
## [163] THUNDERSTORMWINDS THUNDERSTROM WIND
## [165] HIGH WIND 48 EXTREME WINDCHILL
## [167] TSTM WIND/HAIL High Wind
## [169] Tstm Wind Wind
## [171] Wind Damage Strong Wind
## [173] Thunderstorm Wind Strong Winds
## [175] Strong winds Whirlwind
## [177] Gusty Wind Gradient wind
## [179] TSTM WIND (G45) Gusty Winds
## [181] GUSTY WIND TSTM WIND 40
## [183] TSTM WIND 45 TSTM WIND (41)
## [185] TSTM WIND (G40) TSTM WND
## [187] TSTM WIND STRONG WIND GUST
## [189] Gusty winds GRADIENT WIND
## [191] Flood/Strong Wind TSTM WIND AND LIGHTNING
## [193] gradient wind Heavy surf and wind
## [195] TSTM WIND (G45) TSTM WIND (G45)
## [197] HIGH WIND (G40) TSTM WIND (G35)
## [199] WAKE LOW WIND COLD WIND CHILL TEMPERATURES
## [201] BITTER WIND CHILL BITTER WIND CHILL TEMPERATURES
## [203] WIND ADVISORY GUSTY WIND/HAIL
## [205] EXTREME WINDCHILL TEMPERATURES WIND AND WAVE
## [207] WIND TSTM WIND G45
## [209] NON-SEVERE WIND DAMAGE THUNDERSTORM WIND (G40)
## [211] WIND GUSTS GUSTY LAKE WIND
## [213] WND NON-TSTM WIND
## [215] NON TSTM WIND GUSTY THUNDERSTORM WINDS
## [217] MARINE TSTM WIND WHIRLWIND
## [219] EXTREME COLD/WIND CHILL GUSTY THUNDERSTORM WIND
## [221] COLD/WIND CHILL MARINE HIGH WIND
## [223] MARINE THUNDERSTORM WIND MARINE STRONG WIND
## 224 Levels: TSTM WIND TSTM WIND (G45) WIND ... WND
unique(droplevels(stormdata[stormdata$catType == "HAIL",]$EVTYPE))
## [1] HAIL HAIL 1.75) HAIL STORM
## [4] HAIL 75 SMALL HAIL HAIL 80
## [7] FUNNEL CLOUD/HAIL HAIL 0.75 HAIL 1.00
## [10] HAIL 1.75 HAIL 225 HAIL 0.88
## [13] DEEP HAIL HAIL 88 HAIL 175
## [16] HAIL 100 HAIL 150 HAIL 075
## [19] HAIL 125 HAIL 200 HAIL DAMAGE
## [22] THUNDERSTORM HAIL HAIL 088 HAIL/ICY ROADS
## [25] HAIL ALOFT HAIL 275 HAIL 450
## [28] HAILSTORM HAILSTORMS small hail
## [31] Hail(0.75) Small Hail LATE SEASON HAIL
## [34] NON SEVERE HAIL MARINE HAIL
## 35 Levels: DEEP HAIL FUNNEL CLOUD/HAIL HAIL HAIL 0.75 ... THUNDERSTORM HAIL
unique(droplevels(stormdata[stormdata$catType == "FLOOD",]$EVTYPE))
## [1] ICE STORM/FLASH FLOOD FLASH FLOOD
## [3] FLASH FLOODING FLOODING
## [5] FLOOD FLASH FLOODING/THUNDERSTORM WI
## [7] BREAKUP FLOODING RIVER FLOOD
## [9] COASTAL FLOOD FLOOD WATCH/
## [11] FLASH FLOODS HEAVY SURF COASTAL FLOODING
## [13] URBAN FLOODING URBAN/SMALL FLOODING
## [15] LOCAL FLOOD FLOOD/FLASH FLOOD
## [17] URBAN/SMALL STREAM FLOODING STREAM FLOODING
## [19] FLASH FLOOD/ SMALL STREAM URBAN FLOOD
## [21] URBAN FLOOD COASTAL FLOODING
## [23] URBAN/SMALL STREAM FLOOD MINOR FLOODING
## [25] URBAN/SMALL STREAM FLOOD URBAN AND SMALL STREAM FLOOD
## [27] SMALL STREAM FLOODING FLOODS
## [29] SMALL STREAM AND URBAN FLOODIN SMALL STREAM/URBAN FLOOD
## [31] SMALL STREAM AND URBAN FLOOD RURAL FLOOD
## [33] MAJOR FLOOD ICE JAM FLOODING
## [35] STREET FLOOD SMALL STREAM FLOOD
## [37] LAKE FLOOD URBAN AND SMALL STREAM FLOODIN
## [39] RIVER AND STREAM FLOOD MINOR FLOOD
## [41] RIVER FLOODING FLOOD/RIVER FLOOD
## [43] MUD SLIDES URBAN FLOODING HAIL FLOODING
## [45] LOCAL FLASH FLOOD FLOOD/FLASH FLOODING
## [47] COASTAL/TIDAL FLOOD FLASH FLOOD/FLOOD
## [49] FLASH FLOOD FROM ICE JAMS FLASH FLOOD/ STREET
## [51] FLOOD FLASH FLOOD FLOOD/FLASH
## [53] TIDAL FLOOD FLOOD/FLASH
## [55] HIGHWAY FLOODING FLASH FLOOD/ FLOOD
## [57] BEACH EROSION/COASTAL FLOOD RAPIDLY RISING WATER
## [59] SNOWMELT FLOODING FLASH FLOODING/FLOOD
## [61] FLASH FLOOODING BEACH FLOOD
## [63] FLOOD/FLASHFLOOD URBAN SMALL STREAM FLOOD
## [65] URBAN FLOOD LANDSLIDE URBAN FLOODS
## [67] FLASH FLOOD/LANDSLIDE LANDSLIDE/URBAN FLOOD
## [69] FLASH FLOOD LANDSLIDES URBAN/SML STREAM FLD
## [71] Minor Flooding Ice jam flood (minor
## [73] Coastal Flooding COASTALFLOOD
## [75] Erosion/Cstl Flood Tidal Flooding
## [77] River Flooding Flood/Flash Flood
## [79] STREET FLOODING Flood
## [81] TIDAL FLOODING COASTAL FLOOD
## [83] Urban Flooding Urban flood
## [85] Urban Flood Coastal Flood
## [87] coastal flooding Sml Stream Fld
## [89] URBAN/SML STREAM FLDG URBAN/SMALL STRM FLDG
## [91] COASTAL FLOODING/EROSION URBAN/STREET FLOODING
## [93] COASTAL FLOODING/EROSION FLOOD/FLASH/FLOOD
## [95] FLASH FLOOD CSTL FLOODING/EROSION
## [97] LAKESHORE FLOOD
## 97 Levels: COASTAL FLOOD FLASH FLOOD ... URBAN/STREET FLOODING
As you can see, there were some decisions about the order of the pattern matching, that would classify THUNDERSTORM WIND as WIND and not THUNDERSTORM. But with just a handful of cases, these appear to be clean categorizations.
The impacts we want to study are to the population health and economic consequences. The NOAA dataset provides the following data attributes
Details of how these estimates are calculated are covered in the Data Document, along with guidelines for what injuries/fatalies should be attributed to the storm event as opposed to some incidental event (e.g. a traffic accident during a flood evacuation). There is also an appendix that provides guidelines for estimating the dollar amount of damage.
The financial attributes will need to be combined in order to get an estimated dollar amount
summary(as.factor(toupper(stormdata$CROPDMGEXP)))
## ? 0 2 B K M
## 618413 7 19 1 9 281853 1995
summary(as.factor(toupper(stormdata$PROPDMGEXP)))
## - ? + 0 1 2 3 4 5
## 465934 1 8 5 216 25 13 4 4 28
## 6 7 8 B H K M
## 4 5 1 40 7 424665 11337
Most of it is either blank (we will have to ensure the corresponding damage estimates are blank are appropriately scaled > $1000); one of the known (B) billon, (M) million, or (K) thousand units. Lets go ahead and calculate a normalized damage impact for the 3 known magnitudes.
stormdata$CROPDMGEXP <- toupper(stormdata$CROPDMGEXP)
stormdata$PROPDMGEXP <- toupper(stormdata$PROPDMGEXP)
# make sure everything is in line
summary(stormdata[stormdata$PROPDMGEXP == "K",]$PROPDMG)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 0.90 25.28 10.00 5000.00
summary(stormdata[stormdata$PROPDMGEXP == "M",]$PROPDMG)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 1.50 2.50 12.41 6.00 929.00
summary(stormdata[stormdata$PROPDMGEXP == "B",]$PROPDMG)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.100 1.450 2.300 6.896 5.000 115.000
summary(stormdata[stormdata$CROPDMGEXP == "K",]$CROPDMG)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 4.766 0.000 990.000
summary(stormdata[stormdata$CROPDMGEXP == "M",]$CROPDMG)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 1.00 3.00 17.12 9.00 596.00
summary(stormdata[stormdata$CROPDMGEXP == "B",]$CROPDMG)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.200 0.500 1.512 1.510 5.000
# the "K" in PROPDMG had an outlier over 1000
stormdata[stormdata$PROPDMGEXP == "K" & stormdata$PROPDMG > 999,c("PROPDMG", "EVTYPE")]
## PROPDMG EVTYPE
## 749050 3000 HIGH WIND
## 750967 4410 TORNADO
## 755020 3000 FLOOD
## 762429 3500 THUNDERSTORM WIND
## 778568 5000 THUNDERSTORM WIND
## 789445 1000 COASTAL FLOOD
## 791403 4800 LANDSLIDE
## 808182 5000 FLASH FLOOD
## 808183 5000 FLASH FLOOD
## 811973 1000 LIGHTNING
## 815064 3200 THUNDERSTORM WIND
## 815187 1000 TORNADO
## 821163 1000 FLASH FLOOD
## 823219 1000 FLASH FLOOD
## 823303 1000 FLASH FLOOD
## 844264 1000 STRONG WIND
## 848518 1000 WILDFIRE
## 848601 1584 LANDSLIDE
## 878681 1000 FLOOD
## 881000 3000 FLASH FLOOD
## 900685 5000 WATERSPOUT
## 900997 1000 TORNADO
Taking a look, those seem like they were trying to estimate damage appropriately, for instance the record 749050 with 3000 in Property damage is reasonable to assume that is $3 million.
stormdata$propDamage <- ifelse(stormdata$PROPDMGEXP == "K", 1000 * stormdata$PROPDMG, NA)
stormdata$propDamage <- ifelse(stormdata$PROPDMGEXP == "M", 1000000 * stormdata$PROPDMG, stormdata$propDamage)
stormdata$propDamage <- ifelse(stormdata$PROPDMGEXP == "B", 1000000000 * stormdata$PROPDMG, stormdata$propDamage)
stormdata$cropDamage <- ifelse(stormdata$CROPDMGEXP == "K", 1000 * stormdata$CROPDMG, NA)
stormdata$cropDamage <- ifelse(stormdata$CROPDMGEXP == "M", 1000000 * stormdata$CROPDMG, stormdata$cropDamage)
stormdata$cropDamage <- ifelse(stormdata$CROPDMGEXP == "B", 1000000000 * stormdata$CROPDMG, stormdata$cropDamage)
An approximation is made with the remaining damage estimates
# indices where we have a positive value in the raw data but no tidy estimate
noTidyCropDmg <- !is.na(stormdata$CROPDMG) & is.na(stormdata$cropDamage) & (stormdata$CROPDMG > 0)
# average crop damage by type for the missing estimates
group_by(stormdata, catType) %>%
summarize(meanCROPDMG=mean(CROPDMG, na.rm=TRUE),
meanCropDamage=mean(cropDamage, na.rm=TRUE),
count=n(),
countWithDamage=sum(!is.na(cropDamage))) %>%
filter(!is.nan(meanCropDamage)) %>%
subset(catType %in% unique(stormdata[noTidyCropDmg,]$catType))
## # A tibble: 3 x 5
## catType meanCROPDMG meanCropDamage count countWithDamage
## <fctr> <dbl> <dbl> <int> <int>
## 1 WIND 0.613399 18532.22 364880 110429
## 2 HAIL 2.010082 36919.44 289276 82528
## 3 TORNADO 1.648294 43238.65 60685 9597
# since crop damage is all in thousands use that factor
stormdata$cropDamage <- ifelse(is.na(stormdata$cropDamage) & (stormdata$CROPDMG > 0), 1000 * stormdata$CROPDMG, stormdata$cropDamage)
# and check the means again
group_by(stormdata, catType) %>%
summarize(meanCROPDMG=mean(CROPDMG, na.rm=TRUE),
meanCropDamage=mean(cropDamage, na.rm=TRUE),
count=n(),
countWithDamage=sum(!is.na(cropDamage))) %>%
filter(!is.nan(meanCropDamage)) %>%
subset(catType %in% unique(stormdata[noTidyCropDmg,]$catType))
## # A tibble: 3 x 5
## catType meanCROPDMG meanCropDamage count countWithDamage
## <fctr> <dbl> <dbl> <int> <int>
## 1 WIND 0.613399 18531.34 364880 110439
## 2 HAIL 2.010082 36918.82 289276 82530
## 3 TORNADO 1.648294 43241.80 60685 9600
# Look at property damage
noTidyPropDmg <- !is.na(stormdata$PROPDMG) & is.na(stormdata$propDamage) & (stormdata$PROPDMG > 0)
# and average prop damage by type for the missing estimates
group_by(stormdata, catType) %>%
summarize(meanPROPDMG=mean(PROPDMG, na.rm=TRUE),
meanPropDamage=mean(propDamage, na.rm=TRUE),
count=n(),
countWithDamage=sum(!is.na(propDamage))) %>%
filter(!is.nan(meanPropDamage)) %>%
subset(catType %in% unique(stormdata[noTidyPropDmg,]$catType))
## # A tibble: 9 x 5
## catType meanPROPDMG meanPropDamage count countWithDamage
## <fctr> <dbl> <dbl> <int> <int>
## 1 WIND 8.590876 97062.24 364880 182719
## 2 HAIL 2.382900 173271.23 289276 92191
## 3 FLOOD 28.582123 3176728.08 86091 52748
## 4 TORNADO 52.970789 1098395.71 60685 51841
## 5 SNOW 8.087139 347295.32 37327 22266
## 6 LIGHTNING 38.278994 84650.38 15763 10971
## 7 RAIN 4.973675 548224.96 12463 5980
## 8 HEAT 2.395225 71554.78 4109 1821
## 9 ICE 35.018146 2898129.95 2131 1366
# flood, hail, ice, tornado all in the millions
stormdata$propDamage <- ifelse(is.na(stormdata$propDamage) &
(stormdata$catType %in% c("FLOOD","HAIL","ICE","TORNADO")) &
(stormdata$PROPDMG > 0),
1000000 * stormdata$PROPDMG, stormdata$propDamage)
# everything else at 1K
stormdata$propDamage <- ifelse(is.na(stormdata$propDamage) &
(stormdata$PROPDMG > 0),
1000 * stormdata$PROPDMG, stormdata$propDamage)
# confirm we cleared it all up
# Look at property damage
sum(!is.na(stormdata$PROPDMG) & is.na(stormdata$propDamage) & (stormdata$PROPDMG > 0))
## [1] 0
# and recheck the means
group_by(stormdata, catType) %>%
summarize(meanPROPDMG=mean(PROPDMG, na.rm=TRUE),
meanPropDamage=mean(propDamage, na.rm=TRUE),
count=n(),
countWithDamage=sum(!is.na(propDamage))) %>%
filter(!is.nan(meanPropDamage)) %>%
subset(catType %in% unique(stormdata[noTidyPropDmg,]$catType))
## # A tibble: 9 x 5
## catType meanPROPDMG meanPropDamage count countWithDamage
## <fctr> <dbl> <dbl> <int> <int>
## 1 WIND 8.590876 96977.78 364880 182946
## 2 HAIL 2.382900 176787.45 289276 92211
## 3 FLOOD 28.582123 3185621.93 86091 52792
## 4 TORNADO 52.970789 1103902.03 60685 51853
## 5 SNOW 8.087139 347249.10 37327 22269
## 6 LIGHTNING 38.278994 84526.89 15763 10989
## 7 RAIN 4.973675 548133.64 12463 5981
## 8 HEAT 2.395225 71597.83 4109 1822
## 9 ICE 35.018146 2932586.33 2131 1367
# and set up the data frames to drive the analysis
# combines them all all once, and add totals for casulaties and damage
resultN <- stormdata %>% group_by(catType) %>%
summarize(withInjury=sum(INJURIES > 0, na.rm=TRUE),
withFatality=sum(FATALITIES >0, na.rm=TRUE),
withPropDamage=sum(propDamage > 0, na.rm=TRUE),
withCropDamage=sum(cropDamage > 0, na.rm=TRUE),
sumInjury=sum(INJURIES, na.rm=TRUE),
sumFatality=sum(FATALITIES, na.rm=TRUE),
propDamage=sum(propDamage, na.rm=TRUE),
cropDamage=sum(cropDamage, na.rm=TRUE),
count=n()) %>%
mutate(casualties = sumInjury + sumFatality) %>%
mutate(damage = propDamage + cropDamage) %>%
mutate(perInjury = withInjury/count) %>%
mutate(perFatality = withFatality/count) %>%
mutate(perPropDamage = withPropDamage/count) %>%
mutate(perCropDamage = withCropDamage/count)
# top10 results by casualites and damage (and combined top impacts)
result10cas <- head(resultN[order(resultN$casualties,decreasing=TRUE),], 10)
result10dmg <- head(resultN[order(resultN$damage,decreasing=TRUE),], 10)
combinedType <- intersect(result10cas$catType, result10dmg$catType)
result10 <- resultN[resultN$catType %in% combinedType,]
This chart outlines the top events impacts to total casualities (fatalities and injuries) and total estimated damage. As you can see the FLOOD events by far cause the most damage, and the TORNADO events by far cause the most impact to human health.
ggplot(result10, aes(x=damage, y=casualties)) +
geom_point(aes(col=catType, size=sumFatality)) +
labs(title="Damage Vs Casualties",
x="Damage",
y="Casualties",
color="Storm Type",
size="Fatalities") +
scale_x_continuous(labels=scales::unit_format("$B", 1e-9)) +
guides(color=guide_legend(override.aes=(list(size=4)))) +
theme_bw()
These dataframes will be used for more detailed exploration on the impacts
annual <- stormdata %>%
mutate(year = as.Date(format(date,"%Y/01/01"))) %>%
group_by(year, catType) %>%
summarize(propDamage=sum(propDamage, na.rm=TRUE), cropDamage=sum(cropDamage, na.rm=TRUE),
INJURIES=sum(INJURIES, na.rm=TRUE), FATALITIES=sum(FATALITIES, na.rm=TRUE)) %>%
mutate(damage=propDamage + cropDamage, casualties = INJURIES + FATALITIES) %>%
mutate(isTornado=(catType == "TORNADO"))
annualTotal <- group_by(annual, year) %>%
summarize(propDamage=sum(propDamage, na.rm=TRUE), cropDamage=sum(cropDamage, na.rm=TRUE),
INJURIES=sum(INJURIES, na.rm=TRUE), FATALITIES=sum(FATALITIES, na.rm=TRUE),
damage=sum(damage), casualties=sum(casualties))
annualTotal2 <- group_by(annual, year, isTornado) %>%
summarize(propDamage=sum(propDamage, na.rm=TRUE),
cropDamage=sum(cropDamage, na.rm=TRUE),
INJURIES=sum(INJURIES, na.rm=TRUE),
FATALITIES=sum(FATALITIES, na.rm=TRUE),
damage=sum(damage),
casualties=sum(casualties))
summary(annualTotal)
## year propDamage cropDamage
## Min. :1950-01-01 Min. :3.448e+07 Min. :0.000e+00
## 1st Qu.:1965-04-02 1st Qu.:2.041e+08 1st Qu.:0.000e+00
## Median :1980-07-02 Median :1.080e+09 Median :0.000e+00
## Mean :1980-07-02 Mean :6.913e+09 Mean :7.920e+08
## 3rd Qu.:1995-10-01 3rd Qu.:5.523e+09 3rd Qu.:1.207e+09
## Max. :2011-01-01 Max. :1.219e+11 Max. :5.807e+09
## INJURIES FATALITIES damage casualties
## Min. : 524 Min. : 24.00 Min. :3.448e+07 Min. : 558
## 1st Qu.: 1044 1st Qu.: 55.75 1st Qu.:2.041e+08 1st Qu.: 1098
## Median : 1790 Median : 96.50 Median :1.080e+09 Median : 1864
## Mean : 2267 Mean : 244.27 Mean :7.705e+09 Mean : 2511
## 3rd Qu.: 2722 3rd Qu.: 408.25 3rd Qu.:7.852e+09 3rd Qu.: 3167
## Max. :11177 Max. :1491.00 Max. :1.255e+11 Max. :11864
The Casualty compared to Damage graphic shows that Tornados and Floods are the biggest impacts to the human population.
This shows how likely an event will result in injury an is in the most frequent event categories and the magnitude of injuries
ggplot(result10, aes(catType)) +
geom_point(aes(y = perInjury, size=sumInjury, col="Injury")) +
geom_point(aes(y = perFatality, size=sumFatality, col="Fatality")) +
labs(title="Liklihood and Magnitude of Injuries",
x="Event Type",
y="Liklihood (%)",
size="Casualities",
col="Type of Impact") +
coord_flip() +
theme_bw()
This shows us that over 10% of Tornado result in some injuries, in fact these are the types of events that cause the most injuries.
Using a similar approach to the human impact we can see the liklihood of property damage and the magnitude of the damage based on the most frequently occuring events.
ggplot(result10, aes(catType)) +
geom_point(aes(y = perCropDamage, size=cropDamage, col="Crop")) +
geom_point(aes(y = perPropDamage, size=propDamage, col="Property")) +
labs(title="Liklihood and Magnitude of Crop Damage",
x="Event Type",
y="Liklihood (%)",
size="Damage ($)",
col="Type") +
scale_size_continuous(labels=comma) +
scale_y_continuous(labels=percent) +
coord_flip() +
theme_bw()
Across the board we can see that Hurricane cause the events that are expected to most likley cause damage, however floods cause the most in terms of dollar costs to both property and crops.
The results of this analysis on the data set show that Tornado events are the cause of most injuries, in addition if an individual is in a Tornado, it is more likely to be injured than in the other commonly occuring events.
Flood events cause the most damage, however it is more likely a Hurricane or Tornado would result in property damage, when the Flood does occur the damage can be signciantly more.