This analysis looked at NOAA storm data since 1950 to find out what sorts of storms cause the most problems with human health and wealth. We found that tornadoes are by far the worst sorts of storms for human health, and heat is second. We did not finish our analysis on the economic impact.
From the Coursera course website we obtained data on storms and severe weather events that were sourced from the U.S. National Oceanic and Atmospheric Administration's (NOAA) storm database.
We first read in the file from the original bz2 file, and then create the storm data frame.
storm <- read.csv(bzfile('repdata-data-StormData.csv.bz2', 'r'), header = TRUE)
Now we check the storm data frame and its structure to confirm we can work with the data as loaded.
dim(storm)
## [1] 902297 37
head(storm)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO 0 0
## 2 TORNADO 0 0
## 3 TORNADO 0 0
## 4 TORNADO 0 0
## 5 TORNADO 0 0
## 6 TORNADO 0 0
## COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1 NA 0 14.0 100 3 0 0
## 2 NA 0 2.0 150 2 0 0
## 3 NA 0 0.1 123 2 0 0
## 4 NA 0 0.0 100 2 0 0
## 5 NA 0 0.0 150 2 0 0
## 6 NA 0 1.5 177 2 0 0
## INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1 15 25.0 K 0
## 2 0 2.5 K 0
## 3 2 25.0 K 0
## 4 2 2.5 K 0
## 5 2 2.5 K 0
## 6 6 2.5 K 0
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3040 8812 3051 8806 1
## 2 3042 8755 0 0 2
## 3 3340 8742 0 0 3
## 4 3458 8626 0 0 4
## 5 3412 8642 0 0 5
## 6 3450 8748 0 0 6
str(storm)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels ""," Christiansburg",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels ""," CANTON"," TULIA",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","%SD",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436781 levels "","\t","\t\t",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
names(storm)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
We will only be using a small handful of the variables for this analysis – including EVTYPE (event type), FATALITIES, INJURIES, and the various damage-related variables.
Before moving into the analysis, let's take a look at which types of events are most reported. We'll take the top 20.
head(sort(table(storm$EVTYPE), decreasing = TRUE), 20)
##
## HAIL TSTM WIND THUNDERSTORM WIND
## 288661 219940 82563
## TORNADO FLASH FLOOD FLOOD
## 60652 54277 25326
## THUNDERSTORM WINDS HIGH WIND LIGHTNING
## 20843 20212 15754
## HEAVY SNOW HEAVY RAIN WINTER STORM
## 15708 11723 11433
## WINTER WEATHER FUNNEL CLOUD MARINE TSTM WIND
## 7026 6839 6175
## MARINE THUNDERSTORM WIND WATERSPOUT STRONG WIND
## 5812 3796 3566
## URBAN/SML STREAM FLD WILDFIRE
## 3392 2761
So thunderstorm-related storms are by far the highest number of reports, between hail, wind, tornados, and flash floods. This doesn't tell us how damaging these storms are, just that they have the most reports.
As we were inspecting the data, we discovered that one major problem that we were seeing was somewhat inconsistent reporting for weather events. For example,
heat is alternatively described as 'heat', 'excessive heat', or 'extreme heat.' As another example, hail is often reported just as 'hail' but sometimes as hail with a size of the pellets. In order to ease our analysis, we will clean up the EVTYPE variable.
unique(grep('HEAT', storm$EVTYPE, value = TRUE))
## [1] "HEAT" "EXTREME HEAT"
## [3] "EXCESSIVE HEAT" "RECORD HEAT"
## [5] "HEAT WAVE" "DROUGHT/EXCESSIVE HEAT"
## [7] "RECORD HEAT WAVE" "RECORD/EXCESSIVE HEAT"
## [9] "HEAT WAVES" "HEAT WAVE DROUGHT"
## [11] "HEAT/DROUGHT" "HEAT DROUGHT"
## [13] "EXCESSIVE HEAT/DROUGHT"
So there are many ways to express heat, but we want to disinclude reasons of drought, because those aren't only related to heat.
heat <- unique(intersect(grep('HEAT', storm$EVTYPE),
grep('DROUGHT', storm$EVTYPE, invert = TRUE)))
storm$EVTYPE <- replace(storm$EVTYPE, heat, 'HEAT')
There are a few other variables that need fixing likewise, in particular those regarding tornadoes, hurricanes, snow, ice, flooding, wind, hail, and thunderstorms.
# Fix Tornadoes
unique(grep('TORN', storm$EVTYPE, value = TRUE))
## [1] "TORNADO" "TORNADO F0"
## [3] "TORNADOS" "WATERSPOUT/TORNADO"
## [5] "WATERSPOUT TORNADO" "WATERSPOUT-TORNADO"
## [7] "TORNADOES, TSTM WIND, HAIL" "COLD AIR TORNADO"
## [9] "WATERSPOUT/ TORNADO" "TORNADO F3"
## [11] "TORNDAO" "TORNADO F1"
## [13] "TORNADO/WATERSPOUT" "TORNADO F2"
## [15] "TORNADOES" "TORNADO DEBRIS"
tornadoes <- grep('TORN', storm$EVTYPE)
storm$EVTYPE <- replace(storm$EVTYPE, tornadoes, 'TORNADO')
# Fix hurricanes
unique(grep('HURR', storm$EVTYPE, value = TRUE))
## [1] "HURRICANE OPAL/HIGH WINDS" "HURRICANE ERIN"
## [3] "HURRICANE OPAL" "HURRICANE"
## [5] "HURRICANE-GENERATED SWELLS" "HURRICANE EMILY"
## [7] "HURRICANE GORDON" "HURRICANE FELIX"
## [9] "HURRICANE/TYPHOON"
hurricane <- grep('HURR', storm$EVTYPE)
storm$EVTYPE <- replace(storm$EVTYPE, hurricane, 'HURRICANE')
# Fix Ice Storm
unique(grep('ICE STORM', storm$EVTYPE, value = TRUE))
## [1] "ICE STORM/FLASH FLOOD" "ICE STORM"
## [3] "SLEET/ICE STORM" "SNOW AND ICE STORM"
## [5] "HEAVY SNOW/ICE STORM" "HEAVY SNOW AND ICE STORM"
## [7] "ICE STORM AND SNOW" "GLAZE/ICE STORM"
## [9] "SNOW/ICE STORM"
icestorm <- grep('ICE STORM', storm$EVTYPE)
storm$EVTYPE <- replace(storm$EVTYPE, icestorm, 'ICE STORM')
# Fix Ice
unique(grep('ICE', storm$EVTYPE, value = TRUE))
## [1] "ICE STORM" "SNOW/ICE"
## [3] "ICE" "GLAZE ICE"
## [5] "ICE JAM FLOODING" "ICE/SNOW"
## [7] "SNOW AND ICE" "ICE FLOES"
## [9] "HEAVY SNOW AND ICE" "HEAVY SNOW/ICE"
## [11] "ICE JAM" "FLASH FLOOD FROM ICE JAMS"
## [13] "ICE AND SNOW" "HEAVY SNOW & ICE"
## [15] "ICE/STRONG WINDS" "SNOW/ ICE"
## [17] "BLACK ICE" "ICE PELLETS"
## [19] "ICE ROADS" "FALLING SNOW/ICE"
## [21] "PATCHY ICE" "ICE ON ROAD"
ice <- unique(intersect(grep('ICE', storm$EVTYPE),
grep('ICE STORM', storm$EVTYPE, invert = TRUE)))
storm$EVTYPE <- replace(storm$EVTYPE, ice, 'ICE')
# Fix Snow
unique(grep('SNOW', storm$EVTYPE, value = TRUE))
## [1] "SNOW" "HEAVY SNOW"
## [3] "HIGH WIND AND HEAVY SNOW" "HEAVY SNOW/HIGH"
## [5] "HEAVY SNOW/HIGH WINDS/FREEZING" "HIGH WIND/HEAVY SNOW"
## [7] "RECORD SNOWFALL" "HEAVY SNOW/WIND"
## [9] "SNOW AND WIND" "HEAVY SNOWPACK"
## [11] "BLIZZARD/HEAVY SNOW" "HEAVY SNOW/HIGH WINDS"
## [13] "BLOWING SNOW" "LIGHT SNOW AND SLEET"
## [15] "FIRST SNOW" "SLEET/RAIN/SNOW"
## [17] "RAIN/SNOW" "SNOW/RAIN/SLEET"
## [19] "HEAVY SNOW/HIGH WIND" "FREEZING RAIN/SNOW"
## [21] "THUNDERSNOW" "HEAVY RAIN/SNOW"
## [23] "SNOW/SLEET/FREEZING RAIN" "EARLY SNOW"
## [25] "HEAVY SNOW/BLOWING SNOW" "BLOWING SNOW- EXTREME WIND CHI"
## [27] "SNOW AND HEAVY SNOW" "SNOW/HEAVY SNOW"
## [29] "SNOW- HIGH WIND- WIND CHILL" "HEAVY SNOW/BLIZZARD"
## [31] "SNOWSTORM" "HEAVY SNOW/SLEET"
## [33] "SNOW/RAIN" "SNOW SQUALLS"
## [35] "SNOW SQUALL" "HEAVY LAKE SNOW"
## [37] "HEAVY SNOW/FREEZING RAIN" "LAKE EFFECT SNOW"
## [39] "HEAVY WET SNOW" "BLIZZARD AND HEAVY SNOW"
## [41] "HEAVY SNOW ANDBLOWING SNOW" "BLOWING SNOW & EXTREME WIND CH"
## [43] "HEAVY SNOW/WINTER STORM" "HEAVY SNOW AND HIGH WINDS"
## [45] "HEAVY SNOW/HIGH WINDS & FLOOD" "WET SNOW"
## [47] "LIGHT SNOW" "RECORD SNOW"
## [49] "SNOW/COLD" "HEAVY SNOW SQUALLS"
## [51] "HEAVY SNOW/SQUALLS" "HEAVY SNOW-SQUALLS"
## [53] "SNOW FREEZING RAIN" "LACK OF SNOW"
## [55] "SNOW/SLEET" "SNOW/FREEZING RAIN"
## [57] "SNOW DROUGHT" "HEAVY SNOW FREEZING RAIN"
## [59] "FREEZING RAIN AND SNOW" "SNOW SHOWERS"
## [61] "HEAVY SNOW/BLIZZARD/AVALANCHE" "RECORD SNOW/COLD"
## [63] "SNOW/ BITTER COLD" "SNOW SLEET"
## [65] "SNOW/HIGH WINDS" "HIGH WINDS/SNOW"
## [67] "SNOWMELT FLOODING" "HEAVY SNOW AND STRONG WINDS"
## [69] "SNOW ACCUMULATION" "BLOWING SNOW/EXTREME WIND CHIL"
## [71] "SNOW/BLOWING SNOW" "NEAR RECORD SNOW"
## [73] "SLEET/SNOW" "SNOW/SLEET/RAIN"
## [75] "SNOW AND COLD" "PROLONG COLD/SNOW"
## [77] "SNOW\\COLD" "SNOWFALL RECORD"
## [79] "HEAVY SNOW AND" "LAKE-EFFECT SNOW"
## [81] "LATE SNOW" "COLD AND SNOW"
## [83] "MODERATE SNOW" "MODERATE SNOWFALL"
## [85] "SNOW AND SLEET" "LIGHT SNOW/FREEZING PRECIP"
## [87] "EARLY SNOWFALL" "EXCESSIVE SNOW"
## [89] "MONTHLY SNOWFALL" "LATE SEASON SNOW"
## [91] "SNOW ADVISORY" "UNUSUALLY LATE SNOW"
## [93] "ACCUMULATED SNOWFALL"
snow <- unique(grep('SNOW', storm$EVTYPE))
storm$EVTYPE <- replace(storm$EVTYPE, snow, 'SNOW')
# Fix Hail
unique(grep('HAIL', storm$EVTYPE, value = TRUE))
## [1] "HAIL" "THUNDERSTORM WINDS/HAIL"
## [3] "THUNDERSTORM WINDS HAIL" "HAIL 1.75)"
## [5] "HAIL STORM" "HAIL 75"
## [7] "SMALL HAIL" "HAIL 80"
## [9] "FUNNEL CLOUD/HAIL" "HAIL 0.75"
## [11] "HAIL 1.00" "HAIL/WINDS"
## [13] "HAIL/WIND" "HAIL 1.75"
## [15] "WIND/HAIL" "THUNDERSTORM WINDS/ HAIL"
## [17] "HAIL 225" "HAIL 0.88"
## [19] "DEEP HAIL" "HAIL 88"
## [21] "HAIL 175" "HAIL 100"
## [23] "HAIL 150" "HAIL 075"
## [25] "HAIL 125" "HAIL 200"
## [27] "HAIL FLOODING" "HAIL DAMAGE"
## [29] "THUNDERSTORM HAIL" "HAIL 088"
## [31] "THUNDERSTORM WINDSHAIL" "HAIL/ICY ROADS"
## [33] "HAIL ALOFT" "THUNDERSTORM WIND/HAIL"
## [35] "HAIL 275" "HAIL 450"
## [37] "HAILSTORM" "HAILSTORMS"
## [39] "TSTM WIND/HAIL" "GUSTY WIND/HAIL"
## [41] "LATE SEASON HAIL" "NON SEVERE HAIL"
## [43] "MARINE HAIL"
hail <- unique(grep('HAIL', storm$EVTYPE))
storm$EVTYPE <- replace(storm$EVTYPE, hail, 'HAIL')
# Fix Flash Flooding
unique(grep('FLASH FLOOD', storm$EVTYPE, value = TRUE))
## [1] "FLASH FLOOD" "FLASH FLOODING"
## [3] "FLASH FLOODING/THUNDERSTORM WI" "FLASH FLOODS"
## [5] "FLOOD/FLASH FLOOD" "FLASH FLOOD WINDS"
## [7] "FLASH FLOOD/" "THUNDERSTORM WINDS/FLASH FLOOD"
## [9] "LOCAL FLASH FLOOD" "FLOOD/FLASH FLOODING"
## [11] "FLASH FLOOD/FLOOD" "FLASH FLOOD - HEAVY RAIN"
## [13] "FLASH FLOOD/ STREET" "FLASH FLOOD/HEAVY RAIN"
## [15] "FLASH FLOOD/ FLOOD" "FLASH FLOODING/FLOOD"
## [17] "FLASH FLOOD/LANDSLIDE" "FLASH FLOOD LANDSLIDES"
## [19] " FLASH FLOOD"
ff <- unique(grep('FLASH FLOOD', storm$EVTYPE))
storm$EVTYPE <- replace(storm$EVTYPE, ff, 'FLASH FLOOD')
# Combine all other types of flooding into one label "FLOOD"
unique(grep('FLOOD', storm$EVTYPE, value = TRUE))
## [1] "FLASH FLOOD" "FLOODING"
## [3] "FLOOD" "BREAKUP FLOODING"
## [5] "RIVER FLOOD" "COASTAL FLOOD"
## [7] "FLOOD WATCH/" "FLOODING/HEAVY RAIN"
## [9] "HEAVY SURF COASTAL FLOODING" "URBAN FLOODING"
## [11] "URBAN/SMALL FLOODING" "LOCAL FLOOD"
## [13] "FLOOD/RAIN/WINDS" "URBAN/SMALL STREAM FLOODING"
## [15] "STREAM FLOODING" "FLOOD/RAIN/WIND"
## [17] "SMALL STREAM URBAN FLOOD" "URBAN FLOOD"
## [19] "HEAVY RAIN/FLOODING" "COASTAL FLOODING"
## [21] "HIGH WINDS/FLOODING" "URBAN/SMALL STREAM FLOOD"
## [23] "MINOR FLOODING" "URBAN/SMALL STREAM FLOOD"
## [25] "URBAN AND SMALL STREAM FLOOD" "SMALL STREAM FLOODING"
## [27] "FLOODS" "SMALL STREAM AND URBAN FLOODIN"
## [29] "SMALL STREAM/URBAN FLOOD" "SMALL STREAM AND URBAN FLOOD"
## [31] "RURAL FLOOD" "THUNDERSTORM WINDS URBAN FLOOD"
## [33] "MAJOR FLOOD" "STREET FLOOD"
## [35] "SMALL STREAM FLOOD" "LAKE FLOOD"
## [37] "URBAN AND SMALL STREAM FLOODIN" "RIVER AND STREAM FLOOD"
## [39] "MINOR FLOOD" "HIGH WINDS/COASTAL FLOOD"
## [41] "RIVER FLOODING" "FLOOD/RIVER FLOOD"
## [43] "MUD SLIDES URBAN FLOODING" "HEAVY RAIN AND FLOOD"
## [45] "COASTAL/TIDAL FLOOD" "HEAVY RAIN; URBAN FLOOD WINDS;"
## [47] "FLOOD FLASH" "FLOOD FLOOD/FLASH"
## [49] "TIDAL FLOOD" "FLOOD/FLASH"
## [51] "HEAVY RAINS/FLOODING" "THUNDERSTORM WINDS/FLOODING"
## [53] "HIGHWAY FLOODING" "HEAVY RAIN/MUDSLIDES/FLOOD"
## [55] "BEACH EROSION/COASTAL FLOOD" "BEACH FLOOD"
## [57] "THUNDERSTORM WINDS/ FLOOD" "FLOOD & HEAVY RAIN"
## [59] "FLOOD/FLASHFLOOD" "URBAN SMALL STREAM FLOOD"
## [61] "URBAN FLOOD LANDSLIDE" "URBAN FLOODS"
## [63] "HEAVY RAIN/URBAN FLOOD" "LANDSLIDE/URBAN FLOOD"
## [65] "COASTALFLOOD" "STREET FLOODING"
## [67] "TIDAL FLOODING" " COASTAL FLOOD"
## [69] "COASTAL FLOODING/EROSION" "URBAN/STREET FLOODING"
## [71] "COASTAL FLOODING/EROSION" "FLOOD/FLASH/FLOOD"
## [73] "CSTL FLOODING/EROSION" "LAKESHORE FLOOD"
unique(grep('FLD', storm$EVTYPE, value = TRUE))
## [1] "URBAN/SML STREAM FLD" "URBAN/SML STREAM FLDG" "URBAN/SMALL STRM FLDG"
flood <- unique(intersect(union(grep('FLOOD', storm$EVTYPE),
grep('FLD', storm$EVTYPE)),
grep('FLASH FLOOD', storm$EVTYPE, invert = TRUE)))
storm$EVTYPE <- replace(storm$EVTYPE, flood, 'FLOOD')
# Fix Thunderstorm Wind
unique(grep('TSTM WIND', storm$EVTYPE, value = TRUE))
## [1] "TSTM WIND" "TSTM WIND 51"
## [3] "TSTM WIND 50" "TSTM WIND 52"
## [5] "TSTM WIND 55" "TSTM WIND G58"
## [7] "TSTM WIND DAMAGE" "TSTM WINDS"
## [9] "TSTM WIND 65)" "TSTM WIND (G45)"
## [11] "TSTM WIND 40" "TSTM WIND 45"
## [13] "TSTM WIND (41)" "TSTM WIND (G40)"
## [15] " TSTM WIND" "TSTM WIND AND LIGHTNING"
## [17] " TSTM WIND (G45)" "TSTM WIND (G45)"
## [19] "TSTM WIND (G35)" "TSTM WIND G45"
## [21] "NON-TSTM WIND" "NON TSTM WIND"
## [23] "MARINE TSTM WIND"
unique(grep('THUNDERSTORM WIND', storm$EVTYPE, value = TRUE))
## [1] "THUNDERSTORM WINDS" "THUNDERSTORM WIND"
## [3] "THUNDERSTORM WINDS LIGHTNING" "THUNDERSTORM WINDS/FUNNEL CLOU"
## [5] "SEVERE THUNDERSTORM WINDS" "LIGHTNING THUNDERSTORM WINDSS"
## [7] "THUNDERSTORM WINDS 60" "THUNDERSTORM WINDSS"
## [9] "LIGHTNING THUNDERSTORM WINDS" "THUNDERSTORM WINDS53"
## [11] "THUNDERSTORM WINDS 13" "THUNDERSTORM WINDS SMALL STREA"
## [13] "THUNDERSTORM WINDS 2" "THUNDERSTORM WINDS 61"
## [15] "THUNDERSTORM WIND/LIGHTNING" "THUNDERSTORM WIND G50"
## [17] "THUNDERSTORM WINDS/HEAVY RAIN" "THUNDERSTORM WINDS LE CEN"
## [19] "THUNDERSTORM WINDS G" "THUNDERSTORM WIND G60"
## [21] "THUNDERSTORM WINDS." "THUNDERSTORM WIND G55"
## [23] "THUNDERSTORM WINDS G60" "THUNDERSTORM WINDS FUNNEL CLOU"
## [25] "THUNDERSTORM WINDS 62" "THUNDERSTORM WINDS 53"
## [27] "THUNDERSTORM WIND 59" "THUNDERSTORM WIND 52"
## [29] "THUNDERSTORM WIND 69" "THUNDERSTORM WIND 60 MPH"
## [31] "THUNDERSTORM WIND 65MPH" "THUNDERSTORM WIND/ TREES"
## [33] "THUNDERSTORM WIND/AWNING" "THUNDERSTORM WIND 98 MPH"
## [35] "THUNDERSTORM WIND TREES" "THUNDERSTORM WIND 59 MPH"
## [37] "THUNDERSTORM WINDS 63 MPH" "THUNDERSTORM WIND/ TREE"
## [39] "THUNDERSTORM WIND 65 MPH" "THUNDERSTORM WIND."
## [41] "THUNDERSTORM WIND 59 MPH." "THUNDERSTORM WINDS AND"
## [43] "THUNDERSTORM WINDS 50" "THUNDERSTORM WIND G52"
## [45] "THUNDERSTORM WINDS 52" "THUNDERSTORM WIND G51"
## [47] "THUNDERSTORM WIND G61" "THUNDERSTORM WIND 50"
## [49] "THUNDERSTORM WIND 56" "THUNDERSTORM WINDS HEAVY RAIN"
## [51] "THUNDERSTORM WIND (G40)" "GUSTY THUNDERSTORM WINDS"
## [53] "GUSTY THUNDERSTORM WIND" "MARINE THUNDERSTORM WIND"
tstorm <- unique(intersect(union(grep('TSTM WIND', storm$EVTYPE),
grep('THUNDERSTORM WIND', storm$EVTYPE)),
grep('NON', storm$EVTYPE, invert = TRUE)))
storm$EVTYPE <- replace(storm$EVTYPE, tstorm, 'TSTM WIND')
head(sort(table(storm$EVTYPE), decreasing = TRUE), 20)
##
## TSTM WIND HAIL TORNADO FLASH FLOOD FLOOD
## 335548 290393 60701 55661 30380
## HIGH WIND SNOW LIGHTNING HEAVY RAIN WINTER STORM
## 20212 17539 15754 11723 11433
## WINTER WEATHER FUNNEL CLOUD WATERSPOUT STRONG WIND WILDFIRE
## 7026 6839 3796 3566 2761
## BLIZZARD HEAT DROUGHT ICE STORM HIGH WINDS
## 2719 2628 2488 2032 1533
Based on this final table, most of the highest-occurring storm types have been cleaned up, and this should be sufficient to answer our questions. We could achieve more fidelity with our analysis if we combined the various forms of snow/winter weather/winter storm, etc; but sometimes these various terms mean slightly different things so we are going to leave them as-is.
When looking at economic impact of the storms, we are looking at Property Damage and Crop Damage. The data are presented as a number in one column and a multiplier in another. The multiplier columns look like what's shown below.
table(storm$PROPDMGEXP)
##
## - ? + 0 1 2 3 4 5
## 465934 1 8 5 216 25 13 4 4 28
## 6 7 8 B h H K m M
## 4 5 1 40 1 6 424665 7 11330
table(storm$CROPDMGEXP)
##
## ? 0 2 B k K m M
## 618413 7 19 1 9 21 281832 1 1994
It is clear what is mean by the exp's such as 'K', 'M', and 'B', but it is not clear what is meant by exp's such as 'H', '0', '2'. Since those unclear columns are such a small part of the overall total we'll basically ignore them.
storm$PROPDMGEXP.number <- as.character(storm$PROPDMGEXP)
proptozero <- grep('\\?|\\+|\\-|0|1|2|3|4|5|6|7|8|h|H', storm$PROPDMGEXP.number)
proptok <- grep('k|K', storm$PROPDMGEXP.number)
proptom <- grep('m|M', storm$PROPDMGEXP.number)
proptob <- grep('B', storm$PROPDMGEXP.number)
storm$PROPDMGEXP.number <- replace(storm$PROPDMGEXP.number, proptozero, 0)
storm$PROPDMGEXP.number <- replace(storm$PROPDMGEXP.number, proptok, 1000)
storm$PROPDMGEXP.number <- replace(storm$PROPDMGEXP.number, proptom, 1000000)
storm$PROPDMGEXP.number <- replace(storm$PROPDMGEXP.number, proptob, 1000000000)
storm$PROPDMGEXP.number <- as.numeric(storm$PROPDMGEXP.number)
# Fix CROPDMGEXP
storm$CROPDMGEXP.number <- as.character(storm$CROPDMGEXP)
croptozero <- grep('\\?|0|2', storm$CROPDMGEXP.number)
croptok <- grep('k|K', storm$CROPDMGEXP.number)
croptom <- grep('m|M', storm$CROPDMGEXP.number)
croptob <- grep('B', storm$CROPDMGEXP.number)
storm$CROPDMGEXP.number <- replace(storm$CROPDMGEXP.number, croptozero, 0)
storm$CROPDMGEXP.number <- replace(storm$CROPDMGEXP.number, croptok, 1000)
storm$CROPDMGEXP.number <- replace(storm$CROPDMGEXP.number, croptom, 1000000)
storm$CROPDMGEXP.number <- replace(storm$CROPDMGEXP.number, croptob, 1000000000)
storm$CROPDMGEXP.number <- as.numeric(storm$CROPDMGEXP.number)
Now we create two extra columns with total Property Damage and total Crop Damage and a final column with total economic impact.
storm$TOTALPROPDMG <- storm$PROPDMG * storm$PROPDMGEXP.number
storm$TOTALCROPDMG <- storm$CROPDMG * storm$CROPDMGEXP.number
storm$ECONIMPACT <- storm$TOTALPROPDMG + storm$TOTALCROPDMG
summary(storm$ECONIMPACT)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00e+00 0.00e+00 0.00e+00 9.35e+05 4.00e+03 1.15e+11 622731
We have two main questions that we are trying to answer in this analysis. First, which types of storms have the most effect on human health? Secondly, what sorts of storms have the most economic impact?
Let's first look a summary for how the storms affect human health overall. The two health-related variables are INJURIES and FATALITIES.
# Display a summary of INJURY data from storm data
summary(storm$INJURIES)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 0.0 0.0 0.2 0.0 1700.0
# Display a summary of FATALITIES data from storm data
summary(storm$FATALITIES)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 0 0 0 0 583
So far and away most storms have little documented effect on human health. It's interesting that in both cases the mean is above the 3rd quartile (which is 0 in both cases), so that means most weather events do not hurt or kill people, but that a few have a very large effect.
We'll table these two variables and look at the top 50 or so to see what the upper end of the injuries and fatalities spectrum looks like.
head(sort(storm$INJURIES, decreasing = TRUE), 50)
## [1] 1700 1568 1228 1150 1150 800 800 785 780 750 700 600 597 560
## [15] 550 519 504 500 500 500 500 500 500 500 463 450 450 450
## [29] 437 411 410 397 385 350 350 350 350 342 325 316 306 300
## [43] 300 300 300 300 293 280 280 275
head(sort(storm$FATALITIES, decreasing = TRUE), 50)
## [1] 583 158 116 114 99 90 75 74 67 57 57 50 49 46 44 42 42
## [18] 42 38 37 36 34 33 33 33 32 32 32 31 31 31 30 30 30
## [35] 29 29 29 27 27 27 26 25 25 25 25 25 24 24 24 24
Looking at the above, we can tell that injuries data are in most cases likely rough estimates – there are too many round numbers for it to be a coincidence – but that fatalities information is likely much more accurate.
We'll create a smaller dataframe consisting of just a few variables from the storm data set to make it easier to sort.
storm.health <- subset(storm, select = c(BGN_DATE, EVTYPE, INJURIES, FATALITIES))
head(storm.health)
## BGN_DATE EVTYPE INJURIES FATALITIES
## 1 4/18/1950 0:00:00 TORNADO 15 0
## 2 4/18/1950 0:00:00 TORNADO 0 0
## 3 2/20/1951 0:00:00 TORNADO 2 0
## 4 6/8/1951 0:00:00 TORNADO 2 0
## 5 11/15/1951 0:00:00 TORNADO 2 0
## 6 11/15/1951 0:00:00 TORNADO 6 0
Now we'll create two separate data frames, sorting on injuries and fatalities, respectively.
# Create df to sort by injuries
storm.health.injuries <- storm.health[ order(storm.health$INJURIES,
decreasing = TRUE), ]
head(storm.health.injuries, 20)
## BGN_DATE EVTYPE INJURIES FATALITIES
## 157885 4/10/1979 0:00:00 TORNADO 1700 42
## 223449 2/8/1994 0:00:00 ICE STORM 1568 1
## 67884 6/9/1953 0:00:00 TORNADO 1228 90
## 116011 4/3/1974 0:00:00 TORNADO 1150 36
## 862634 5/22/2011 0:00:00 TORNADO 1150 158
## 344159 10/17/1998 0:00:00 FLOOD 800 2
## 860386 4/27/2011 0:00:00 TORNADO 800 44
## 68670 6/8/1953 0:00:00 TORNADO 785 116
## 529351 8/13/2004 0:00:00 HURRICANE 780 7
## 344178 10/17/1998 0:00:00 FLOOD 750 0
## 858228 4/27/2011 0:00:00 TORNADO 700 20
## 344158 10/17/1998 0:00:00 FLOOD 600 11
## 148852 5/11/1953 0:00:00 TORNADO 597 114
## 35124 4/11/1965 0:00:00 TORNADO 560 17
## 344163 10/17/1998 0:00:00 FLOOD 550 0
## 667233 8/4/2007 0:00:00 HEAT 519 2
## 78567 3/3/1966 0:00:00 TORNADO 504 57
## 16503 10/3/1979 0:00:00 TORNADO 500 3
## 29566 4/21/1967 0:00:00 TORNADO 500 33
## 153749 5/11/1970 0:00:00 TORNADO 500 26
# Create df to sort by fatalities
storm.health.fatalities <- storm.health[ order(storm.health$FATALITIES,
decreasing = TRUE), ]
head(storm.health.fatalities, 20)
## BGN_DATE EVTYPE INJURIES FATALITIES
## 198704 7/12/1995 0:00:00 HEAT 0 583
## 862634 5/22/2011 0:00:00 TORNADO 1150 158
## 68670 6/8/1953 0:00:00 TORNADO 785 116
## 148852 5/11/1953 0:00:00 TORNADO 597 114
## 355128 7/28/1999 0:00:00 HEAT 0 99
## 67884 6/9/1953 0:00:00 TORNADO 1228 90
## 46309 5/25/1955 0:00:00 TORNADO 270 75
## 371112 7/4/1999 0:00:00 HEAT 135 74
## 230927 7/1/1995 0:00:00 HEAT 0 67
## 78567 3/3/1966 0:00:00 TORNADO 504 57
## 247938 7/13/1995 0:00:00 HEAT 0 57
## 6370 3/21/1952 0:00:00 TORNADO 325 50
## 598500 9/21/2005 0:00:00 HEAT 0 49
## 606363 7/16/2006 0:00:00 HEAT 18 46
## 860386 4/27/2011 0:00:00 TORNADO 800 44
## 157885 4/10/1979 0:00:00 TORNADO 1700 42
## 362850 7/18/1999 0:00:00 HEAT 397 42
## 629242 8/1/2006 0:00:00 HEAT 0 42
## 78123 12/5/1953 0:00:00 TORNADO 270 38
## 83578 5/20/1957 0:00:00 TORNADO 176 37
Now we're going to see what are the types of storms that have the most effect on human health, and order to get the top 6 to plot.
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
stormsgrp <- group_by(storm, EVTYPE)
storm.inj.top <- summarise(stormsgrp, sum = sum(INJURIES),
median = median(INJURIES),
mean = mean(INJURIES),
n = n())
storm.inj.top <- storm.inj.top[ order(storm.inj.top$sum, decreasing = T), ]
head(storm.inj.top)
## Source: local data frame [6 x 5]
##
## EVTYPE sum median mean n
## 525 TORNADO 91407 0 1.5059 60701
## 537 TSTM WIND 9396 0 0.0280 335548
## 179 HEAT 9139 0 3.4775 2628
## 124 FLOOD 6880 0 0.2265 30380
## 292 LIGHTNING 5230 0 0.3320 15754
## 271 ICE STORM 1992 0 0.9803 2032
storm.fat.top <- summarise(stormsgrp, sum = sum(FATALITIES),
median = median(FATALITIES),
mean = mean(FATALITIES),
n = n())
storm.fat.top <- storm.fat.top[ order(storm.fat.top$sum, decreasing = T), ]
head(storm.fat.top)
## Source: local data frame [6 x 5]
##
## EVTYPE sum median mean n
## 525 TORNADO 5661 0 0.093260 60701
## 179 HEAT 3132 0 1.191781 2628
## 121 FLASH FLOOD 1035 0 0.018595 55661
## 292 LIGHTNING 816 0 0.051796 15754
## 537 TSTM WIND 723 0 0.002155 335548
## 124 FLOOD 516 0 0.016985 30380
We chose to order by total sum of injuries and fatalities, because this seems to be the best way of reporting total human toll over time and many events.
Now we create vectors with the top 6 event types for each type of human toll.
topinjuries <- head(storm.inj.top$EVTYPE, 6)
topfatalities <- head(storm.fat.top$EVTYPE, 6)
print(topinjuries)
## [1] TORNADO TSTM WIND HEAT FLOOD LIGHTNING ICE STORM
## 985 Levels: HIGH SURF ADVISORY COASTAL FLOOD ... WND
print(topfatalities)
## [1] TORNADO HEAT FLASH FLOOD LIGHTNING TSTM WIND FLOOD
## 985 Levels: HIGH SURF ADVISORY COASTAL FLOOD ... WND
The above are the worst sorts of storms.
** At this point I had to quit my work due to a family emergency. I did not leave enough time to come back to it. I might have to re-take this class next month. Thanks for your time grading it and looking at it. **