This report is an explanation of an analysis performed to find the most harmful type of storm event in the U.S. The storm data data set from the NOAA was filtered and cleaned according to the analysis to perform and the description of the variables in the provided data documentation. An exploratory analysis was performed finding that in terms of population health the most harmful type of event is tornado, while from an economical point of view, hurricane is the type of event with the greatest consequences.
The data for this assignment was downloaded from this link in the course web site. This is the first step in processing data. The code checks whether the dataset is already in the working directory, if it is not there then it is downloaded. The dataset is read using read.csv with header set to TRUE.
if (!file.exists("stormdata.bz2")){
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", "stormdata.bz2", mode = "wb")
}
stormdata <- read.csv("stormdata.bz2", header = TRUE)
str function is used to get some information in regard to the variables and dimensions of the dataset.
str(stormdata)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
The dataset has 902297 observations of 37 variables. Checking the data documentation we are able to choose which of all the variables are needed to perform the analysis. The variables EVTYPE, FATALITIES and INJURIES are used to find the types of events that are most harmful to population health. The variables PROPDMG, PROPDMGEXP, CROPDMG and CROPDMGEXP are used in order to find which types of events have the greatest economic consequences, since these 4 variables indicate property and crop damage in dollars.
The dataset is subsetted in the next step, deleting all the variables not used in the analysis. Since REMARKS can include additional information on the events and REFNUM is the identifier of the event, these two variables are included in the new dataset. In order to make the analysis faster, the rows are subsetted too, all the individual events with 0 in the variables FATALITIES, INJURIES, PROPDMG and CROPDMG are not included in the new data set, since we want to get the most harmful event types, events with no economic nor human harm are not relevant in this analysis.
subset <- stormdata[,names(stormdata) %in% c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP", "REMARKS", "REFNUM")]
subset <- subset[which(!(subset$FATALITIES==0 & subset$INJURIES==0 & subset$PROPDMG == 0.00 & subset$CROPDMG == 0)),]
dim(subset)
## [1] 254633 9
Since we need to summarize the data for each EVTYPE it is important to know the unique values of this variable.
unique(subset$EVTYPE)
## [1] "TORNADO" "TSTM WIND"
## [3] "HAIL" "ICE STORM/FLASH FLOOD"
## [5] "WINTER STORM" "HURRICANE OPAL/HIGH WINDS"
## [7] "THUNDERSTORM WINDS" "HURRICANE ERIN"
## [9] "HURRICANE OPAL" "HEAVY RAIN"
## [11] "LIGHTNING" "THUNDERSTORM WIND"
## [13] "DENSE FOG" "RIP CURRENT"
## [15] "THUNDERSTORM WINS" "FLASH FLOODING"
## [17] "FLASH FLOOD" "TORNADO F0"
## [19] "THUNDERSTORM WINDS LIGHTNING" "THUNDERSTORM WINDS/HAIL"
## [21] "HEAT" "HIGH WINDS"
## [23] "WIND" "HEAVY RAINS"
## [25] "LIGHTNING AND HEAVY RAIN" "THUNDERSTORM WINDS HAIL"
## [27] "COLD" "HEAVY RAIN/LIGHTNING"
## [29] "FLASH FLOODING/THUNDERSTORM WI" "FLOODING"
## [31] "WATERSPOUT" "EXTREME COLD"
## [33] "LIGHTNING/HEAVY RAIN" "BREAKUP FLOODING"
## [35] "HIGH WIND" "FREEZE"
## [37] "RIVER FLOOD" "HIGH WINDS HEAVY RAINS"
## [39] "AVALANCHE" "MARINE MISHAP"
## [41] "HIGH TIDES" "HIGH WIND/SEAS"
## [43] "HIGH WINDS/HEAVY RAIN" "HIGH SEAS"
## [45] "COASTAL FLOOD" "SEVERE TURBULENCE"
## [47] "RECORD RAINFALL" "HEAVY SNOW"
## [49] "HEAVY SNOW/WIND" "DUST STORM"
## [51] "FLOOD" "APACHE COUNTY"
## [53] "SLEET" "DUST DEVIL"
## [55] "ICE STORM" "EXCESSIVE HEAT"
## [57] "THUNDERSTORM WINDS/FUNNEL CLOU" "GUSTY WINDS"
## [59] "FLOODING/HEAVY RAIN" "HEAVY SURF COASTAL FLOODING"
## [61] "HIGH SURF" "WILD FIRES"
## [63] "HIGH" "WINTER STORM HIGH WINDS"
## [65] "WINTER STORMS" "MUDSLIDES"
## [67] "RAINSTORM" "SEVERE THUNDERSTORM"
## [69] "SEVERE THUNDERSTORMS" "SEVERE THUNDERSTORM WINDS"
## [71] "THUNDERSTORMS WINDS" "FLOOD/FLASH FLOOD"
## [73] "FLOOD/RAIN/WINDS" "THUNDERSTORMS"
## [75] "FLASH FLOOD WINDS" "WINDS"
## [77] "FUNNEL CLOUD" "HIGH WIND DAMAGE"
## [79] "STRONG WIND" "HEAVY SNOWPACK"
## [81] "FLASH FLOOD/" "HEAVY SURF"
## [83] "DRY MIRCOBURST WINDS" "DRY MICROBURST"
## [85] "URBAN FLOOD" "THUNDERSTORM WINDSS"
## [87] "MICROBURST WINDS" "HEAT WAVE"
## [89] "UNSEASONABLY WARM" "COASTAL FLOODING"
## [91] "STRONG WINDS" "BLIZZARD"
## [93] "WATERSPOUT/TORNADO" "WATERSPOUT TORNADO"
## [95] "STORM SURGE" "URBAN/SMALL STREAM FLOOD"
## [97] "WATERSPOUT-" "TORNADOES, TSTM WIND, HAIL"
## [99] "TROPICAL STORM ALBERTO" "TROPICAL STORM"
## [101] "TROPICAL STORM GORDON" "TROPICAL STORM JERRY"
## [103] "LIGHTNING THUNDERSTORM WINDS" "URBAN FLOODING"
## [105] "MINOR FLOODING" "WATERSPOUT-TORNADO"
## [107] "LIGHTNING INJURY" "LIGHTNING AND THUNDERSTORM WIN"
## [109] "FLASH FLOODS" "THUNDERSTORM WINDS53"
## [111] "WILDFIRE" "DAMAGING FREEZE"
## [113] "THUNDERSTORM WINDS 13" "HURRICANE"
## [115] "SNOW" "LIGNTNING"
## [117] "FROST" "FREEZING RAIN/SNOW"
## [119] "HIGH WINDS/" "THUNDERSNOW"
## [121] "FLOODS" "COOL AND WET"
## [123] "HEAVY RAIN/SNOW" "GLAZE ICE"
## [125] "MUD SLIDE" "HIGH WINDS"
## [127] "RURAL FLOOD" "MUD SLIDES"
## [129] "EXTREME HEAT" "DROUGHT"
## [131] "COLD AND WET CONDITIONS" "EXCESSIVE WETNESS"
## [133] "SLEET/ICE STORM" "GUSTNADO"
## [135] "FREEZING RAIN" "SNOW AND HEAVY SNOW"
## [137] "GROUND BLIZZARD" "EXTREME WIND CHILL"
## [139] "MAJOR FLOOD" "SNOW/HEAVY SNOW"
## [141] "FREEZING RAIN/SLEET" "ICE JAM FLOODING"
## [143] "COLD AIR TORNADO" "WIND DAMAGE"
## [145] "FOG" "TSTM WIND 55"
## [147] "SMALL STREAM FLOOD" "THUNDERTORM WINDS"
## [149] "HAIL/WINDS" "SNOW AND ICE"
## [151] "WIND STORM" "GRASS FIRES"
## [153] "LAKE FLOOD" "HAIL/WIND"
## [155] "WIND/HAIL" "ICE"
## [157] "SNOW AND ICE STORM" "THUNDERSTORM WINDS"
## [159] "WINTER WEATHER" "DROUGHT/EXCESSIVE HEAT"
## [161] "THUNDERSTORMS WIND" "TUNDERSTORM WIND"
## [163] "URBAN AND SMALL STREAM FLOODIN" "THUNDERSTORM WIND/LIGHTNING"
## [165] "HEAVY RAIN/SEVERE WEATHER" "THUNDERSTORM"
## [167] "WATERSPOUT/ TORNADO" "LIGHTNING."
## [169] "HURRICANE-GENERATED SWELLS" "RIVER AND STREAM FLOOD"
## [171] "HIGH WINDS/COASTAL FLOOD" "RAIN"
## [173] "RIVER FLOODING" "ICE FLOES"
## [175] "THUNDERSTORM WIND G50" "LIGHTNING FIRE"
## [177] "HEAVY LAKE SNOW" "RECORD COLD"
## [179] "HEAVY SNOW/FREEZING RAIN" "COLD WAVE"
## [181] "DUST DEVIL WATERSPOUT" "TORNADO F3"
## [183] "TORNDAO" "FLOOD/RIVER FLOOD"
## [185] "MUD SLIDES URBAN FLOODING" "TORNADO F1"
## [187] "GLAZE/ICE STORM" "GLAZE"
## [189] "HEAVY SNOW/WINTER STORM" "MICROBURST"
## [191] "AVALANCE" "BLIZZARD/WINTER STORM"
## [193] "DUST STORM/HIGH WINDS" "ICE JAM"
## [195] "FOREST FIRES" "FROST\\FREEZE"
## [197] "THUNDERSTORM WINDS." "HVY RAIN"
## [199] "HAIL 150" "HAIL 075"
## [201] "HAIL 100" "THUNDERSTORM WIND G55"
## [203] "HAIL 125" "THUNDERSTORM WIND G60"
## [205] "THUNDERSTORM WINDS G60" "HARD FREEZE"
## [207] "HAIL 200" "HEAVY SNOW AND HIGH WINDS"
## [209] "HEAVY SNOW/HIGH WINDS & FLOOD" "HEAVY RAIN AND FLOOD"
## [211] "RIP CURRENTS/HEAVY SURF" "URBAN AND SMALL"
## [213] "WILDFIRES" "FOG AND COLD TEMPERATURES"
## [215] "SNOW/COLD" "FLASH FLOOD FROM ICE JAMS"
## [217] "TSTM WIND G58" "MUDSLIDE"
## [219] "HEAVY SNOW SQUALLS" "SNOW SQUALL"
## [221] "SNOW/ICE STORM" "HEAVY SNOW/SQUALLS"
## [223] "HEAVY SNOW-SQUALLS" "ICY ROADS"
## [225] "HEAVY MIX" "SNOW FREEZING RAIN"
## [227] "SNOW/SLEET" "SNOW/FREEZING RAIN"
## [229] "SNOW SQUALLS" "SNOW/SLEET/FREEZING RAIN"
## [231] "RECORD SNOW" "HAIL 0.75"
## [233] "RECORD HEAT" "THUNDERSTORM WIND 65MPH"
## [235] "THUNDERSTORM WIND/ TREES" "THUNDERSTORM WIND/AWNING"
## [237] "THUNDERSTORM WIND 98 MPH" "THUNDERSTORM WIND TREES"
## [239] "TORNADO F2" "RIP CURRENTS"
## [241] "HURRICANE EMILY" "COASTAL SURGE"
## [243] "HURRICANE GORDON" "HURRICANE FELIX"
## [245] "THUNDERSTORM WIND 60 MPH" "THUNDERSTORM WINDS 63 MPH"
## [247] "THUNDERSTORM WIND/ TREE" "THUNDERSTORM DAMAGE TO"
## [249] "THUNDERSTORM WIND 65 MPH" "FLASH FLOOD - HEAVY RAIN"
## [251] "THUNDERSTORM WIND." "FLASH FLOOD/ STREET"
## [253] "BLOWING SNOW" "HEAVY SNOW/BLIZZARD"
## [255] "THUNDERSTORM HAIL" "THUNDERSTORM WINDSHAIL"
## [257] "LIGHTNING WAUSEON" "THUDERSTORM WINDS"
## [259] "ICE AND SNOW" "STORM FORCE WINDS"
## [261] "HEAVY SNOW/ICE" "LIGHTING"
## [263] "HIGH WIND/HEAVY SNOW" "THUNDERSTORM WINDS AND"
## [265] "HEAVY PRECIPITATION" "HIGH WIND/BLIZZARD"
## [267] "TSTM WIND DAMAGE" "FLOOD FLASH"
## [269] "RAIN/WIND" "SNOW/ICE"
## [271] "HAIL 75" "HEAT WAVE DROUGHT"
## [273] "HEAVY SNOW/BLIZZARD/AVALANCHE" "HEAT WAVES"
## [275] "UNSEASONABLY WARM AND DRY" "UNSEASONABLY COLD"
## [277] "RECORD/EXCESSIVE HEAT" "THUNDERSTORM WIND G52"
## [279] "HIGH WAVES" "FLASH FLOOD/FLOOD"
## [281] "FLOOD/FLASH" "LOW TEMPERATURE"
## [283] "HEAVY RAINS/FLOODING" "THUNDERESTORM WINDS"
## [285] "THUNDERSTORM WINDS/FLOODING" "HYPOTHERMIA"
## [287] "THUNDEERSTORM WINDS" "THUNERSTORM WINDS"
## [289] "HIGH WINDS/COLD" "COLD/WINDS"
## [291] "SNOW/ BITTER COLD" "COLD WEATHER"
## [293] "RAPIDLY RISING WATER" "WILD/FOREST FIRE"
## [295] "ICE/STRONG WINDS" "SNOW/HIGH WINDS"
## [297] "HIGH WINDS/SNOW" "SNOWMELT FLOODING"
## [299] "HEAVY SNOW AND STRONG WINDS" "SNOW ACCUMULATION"
## [301] "SNOW/ ICE" "SNOW/BLOWING SNOW"
## [303] "TORNADOES" "THUNDERSTORM WIND/HAIL"
## [305] "FREEZING DRIZZLE" "HAIL 175"
## [307] "FLASH FLOODING/FLOOD" "HAIL 275"
## [309] "HAIL 450" "EXCESSIVE RAINFALL"
## [311] "THUNDERSTORMW" "HAILSTORM"
## [313] "TSTM WINDS" "TSTMW"
## [315] "TSTM WIND 65)" "TROPICAL STORM DEAN"
## [317] "THUNDERSTORM WINDS/ FLOOD" "LANDSLIDE"
## [319] "HIGH WIND AND SEAS" "THUNDERSTORMWINDS"
## [321] "WILD/FOREST FIRES" "HEAVY SEAS"
## [323] "HAIL DAMAGE" "FLOOD & HEAVY RAIN"
## [325] "?" "THUNDERSTROM WIND"
## [327] "FLOOD/FLASHFLOOD" "HIGH WATER"
## [329] "HIGH WIND 48" "LANDSLIDES"
## [331] "URBAN/SMALL STREAM" "BRUSH FIRE"
## [333] "HEAVY SHOWER" "HEAVY SWELLS"
## [335] "URBAN SMALL" "URBAN FLOODS"
## [337] "FLASH FLOOD/LANDSLIDE" "HEAVY RAIN/SMALL STREAM URBAN"
## [339] "FLASH FLOOD LANDSLIDES" "TSTM WIND/HAIL"
## [341] "Other" "Ice jam flood (minor"
## [343] "Tstm Wind" "URBAN/SML STREAM FLD"
## [345] "ROUGH SURF" "Heavy Surf"
## [347] "Dust Devil" "Marine Accident"
## [349] "Freeze" "Strong Wind"
## [351] "COASTAL STORM" "Erosion/Cstl Flood"
## [353] "River Flooding" "Damaging Freeze"
## [355] "Beach Erosion" "High Surf"
## [357] "Heavy Rain/High Surf" "Unseasonable Cold"
## [359] "Early Frost" "Wintry Mix"
## [361] "Extreme Cold" "Coastal Flooding"
## [363] "Torrential Rainfall" "Landslump"
## [365] "Hurricane Edouard" "Coastal Storm"
## [367] "TIDAL FLOODING" "Tidal Flooding"
## [369] "Strong Winds" "EXTREME WINDCHILL"
## [371] "Glaze" "Extended Cold"
## [373] "Whirlwind" "Heavy snow shower"
## [375] "Light snow" "Light Snow"
## [377] "MIXED PRECIP" "Freezing Spray"
## [379] "DOWNBURST" "Mudslides"
## [381] "Microburst" "Mudslide"
## [383] "Cold" "Coastal Flood"
## [385] "Snow Squalls" "Wind Damage"
## [387] "Light Snowfall" "Freezing Drizzle"
## [389] "Gusty wind/rain" "GUSTY WIND/HVY RAIN"
## [391] "Wind" "Cold Temperature"
## [393] "Heat Wave" "Snow"
## [395] "COLD AND SNOW" "RAIN/SNOW"
## [397] "TSTM WIND (G45)" "Gusty Winds"
## [399] "GUSTY WIND" "TSTM WIND 40"
## [401] "TSTM WIND 45" "TSTM WIND (41)"
## [403] "TSTM WIND (G40)" "Frost/Freeze"
## [405] "AGRICULTURAL FREEZE" "OTHER"
## [407] "Hypothermia/Exposure" "HYPOTHERMIA/EXPOSURE"
## [409] "Lake Effect Snow" "Freezing Rain"
## [411] "Mixed Precipitation" "BLACK ICE"
## [413] "COASTALSTORM" "LIGHT SNOW"
## [415] "DAM BREAK" "Gusty winds"
## [417] "blowing snow" "GRADIENT WIND"
## [419] "TSTM WIND AND LIGHTNING" "gradient wind"
## [421] "Gradient wind" "Freezing drizzle"
## [423] "WET MICROBURST" "Heavy surf and wind"
## [425] "TYPHOON" "HIGH SWELLS"
## [427] "SMALL HAIL" "UNSEASONAL RAIN"
## [429] "COASTAL FLOODING/EROSION" " TSTM WIND (G45)"
## [431] "TSTM WIND (G45)" "HIGH WIND (G40)"
## [433] "TSTM WIND (G35)" "COASTAL EROSION"
## [435] "SEICHE" "COASTAL FLOODING/EROSION"
## [437] "HYPERTHERMIA/EXPOSURE" "WINTRY MIX"
## [439] "ROCK SLIDE" "GUSTY WIND/HAIL"
## [441] " TSTM WIND" "LANDSPOUT"
## [443] "EXCESSIVE SNOW" "LAKE EFFECT SNOW"
## [445] "FLOOD/FLASH/FLOOD" "MIXED PRECIPITATION"
## [447] "WIND AND WAVE" "LIGHT FREEZING RAIN"
## [449] "ICE ROADS" "ROUGH SEAS"
## [451] "TSTM WIND G45" "NON-SEVERE WIND DAMAGE"
## [453] "WARM WEATHER" "THUNDERSTORM WIND (G40)"
## [455] " FLASH FLOOD" "LATE SEASON SNOW"
## [457] "WINTER WEATHER MIX" "ROGUE WAVE"
## [459] "FALLING SNOW/ICE" "NON-TSTM WIND"
## [461] "NON TSTM WIND" "BLOWING DUST"
## [463] "VOLCANIC ASH" " HIGH SURF ADVISORY"
## [465] "HAZARDOUS SURF" "WHIRLWIND"
## [467] "ICE ON ROAD" "DROWNING"
## [469] "EXTREME COLD/WIND CHILL" "MARINE TSTM WIND"
## [471] "HURRICANE/TYPHOON" "WINTER WEATHER/MIX"
## [473] "FROST/FREEZE" "ASTRONOMICAL HIGH TIDE"
## [475] "HEAVY SURF/HIGH SURF" "TROPICAL DEPRESSION"
## [477] "LAKE-EFFECT SNOW" "MARINE HIGH WIND"
## [479] "TSUNAMI" "STORM SURGE/TIDE"
## [481] "COLD/WIND CHILL" "LAKESHORE FLOOD"
## [483] "MARINE THUNDERSTORM WIND" "MARINE STRONG WIND"
## [485] "ASTRONOMICAL LOW TIDE" "DENSE SMOKE"
## [487] "MARINE HAIL" "FREEZING FOG"
length(unique(subset$EVTYPE))
## [1] 488
As we can see from previous code results, there are 488 unique values in EVETYPE while the data documentation indicates only 48 event types. The data has to be cleaned since some event types are repeated with typos or lower case and others can be classified in one of 48 types described in documentation
Here all data types in dataset are changed to upper case.
subset$EVTYPE <- toupper(subset$EVTYPE)
length(unique(subset$EVTYPE))
## [1] 447
Applying toupper function we could reduce the number of unique values to 447. The event types from the data documentation are added manually in the next code chunk so it can be matched with the EVTYPE variable.
events <- "Astronomical Low Tide,Avalanche,Blizzard,Coastal Flood,Cold/Wind Chill,Debris Flow,Dense Fog,Dense Smoke,Drought,Dust Devil,Dust Storm,Excessive Heat,Extreme Cold/Wind Chill,Flash Flood,Flood,Frost/Freeze,Funnel Cloud,Freezing Fog,Hail,Heat,Heavy Rain,Heavy Snow,High Surf,High Wind,Hurricane (Typhoon),Ice Storm,Lake-Effect Snow,Lakeshore Flood,Lightning,Marine Hail,Marine High Wind,Marine Strong Wind,Marine Thunderstorm Wind,Rip Current,Seiche,Sleet,Storm Surge/Tide,Strong Wind,Thunderstorm Wind,Tornado,Tropical Depression,Tropical Storm,Tsunami,Volcanic Ash,Waterspout,Wildfire,Winter Storm,Winter Weather"
events <- strsplit(events, ",")
evtype_doc <- toupper(events[[1]])
In the next code chunk we creae a new function, this function compares each character in evtype_code with the characters in EVTYPE variable using the function amatch, it creates a list with 48 elements, each of this has the matched characters to the 48 event types described in documentation. This list lets us see the matched characters and according to the results make changes in the dataset. Next the function is called with a maxdist value of 2.
mat_groups <- function(maxdist){
evtype_groups <- list()
for (i in evtype_doc){
matched <- unique(subset$EVTYPE)[!is.na(amatch(unique(subset$EVTYPE), i, maxDist = maxdist))]
evtype_groups[[length(evtype_groups)+1]] <- list(matched)
}
evtype_groups
}
mat_groups(2)
## [[1]]
## [[1]][[1]]
## [1] "ASTRONOMICAL LOW TIDE"
##
##
## [[2]]
## [[2]][[1]]
## [1] "AVALANCHE" "AVALANCE"
##
##
## [[3]]
## [[3]][[1]]
## [1] "BLIZZARD"
##
##
## [[4]]
## [[4]][[1]]
## [1] "COASTAL FLOOD"
##
##
## [[5]]
## [[5]][[1]]
## [1] "COLD/WIND CHILL"
##
##
## [[6]]
## [[6]][[1]]
## character(0)
##
##
## [[7]]
## [[7]][[1]]
## [1] "DENSE FOG"
##
##
## [[8]]
## [[8]][[1]]
## [1] "DENSE SMOKE"
##
##
## [[9]]
## [[9]][[1]]
## [1] "DROUGHT"
##
##
## [[10]]
## [[10]][[1]]
## [1] "DUST DEVIL"
##
##
## [[11]]
## [[11]][[1]]
## [1] "DUST STORM"
##
##
## [[12]]
## [[12]][[1]]
## [1] "EXCESSIVE HEAT"
##
##
## [[13]]
## [[13]][[1]]
## [1] "EXTREME COLD/WIND CHILL"
##
##
## [[14]]
## [[14]][[1]]
## [1] "FLASH FLOOD" "FLASH FLOOD/" "FLASH FLOODS" " FLASH FLOOD"
##
##
## [[15]]
## [[15]][[1]]
## [1] "FLOOD" "FLOODS"
##
##
## [[16]]
## [[16]][[1]]
## [1] "FROST\\FREEZE" "FROST/FREEZE"
##
##
## [[17]]
## [[17]][[1]]
## [1] "FUNNEL CLOUD"
##
##
## [[18]]
## [[18]][[1]]
## [1] "FREEZING FOG"
##
##
## [[19]]
## [[19]][[1]]
## [1] "HAIL" "RAIN"
##
##
## [[20]]
## [[20]][[1]]
## [1] "HEAT"
##
##
## [[21]]
## [[21]][[1]]
## [1] "HEAVY RAIN" "HEAVY RAINS" "HVY RAIN"
##
##
## [[22]]
## [[22]][[1]]
## [1] "HEAVY SNOW"
##
##
## [[23]]
## [[23]][[1]]
## [1] "HIGH SURF"
##
##
## [[24]]
## [[24]][[1]]
## [1] "HIGH WINDS" "HIGH WIND" "HIGH WINDS/" "HIGH WINDS"
##
##
## [[25]]
## [[25]][[1]]
## character(0)
##
##
## [[26]]
## [[26]][[1]]
## [1] "ICE STORM"
##
##
## [[27]]
## [[27]][[1]]
## [1] "LAKE EFFECT SNOW" "LAKE-EFFECT SNOW"
##
##
## [[28]]
## [[28]][[1]]
## [1] "LAKESHORE FLOOD"
##
##
## [[29]]
## [[29]][[1]]
## [1] "LIGHTNING" "LIGNTNING" "LIGHTNING." "LIGHTING"
##
##
## [[30]]
## [[30]][[1]]
## [1] "MARINE HAIL"
##
##
## [[31]]
## [[31]][[1]]
## [1] "MARINE HIGH WIND"
##
##
## [[32]]
## [[32]][[1]]
## [1] "MARINE STRONG WIND"
##
##
## [[33]]
## [[33]][[1]]
## [1] "MARINE THUNDERSTORM WIND"
##
##
## [[34]]
## [[34]][[1]]
## [1] "RIP CURRENT" "RIP CURRENTS"
##
##
## [[35]]
## [[35]][[1]]
## [1] "SEICHE"
##
##
## [[36]]
## [[36]][[1]]
## [1] "SLEET"
##
##
## [[37]]
## [[37]][[1]]
## [1] "STORM SURGE/TIDE"
##
##
## [[38]]
## [[38]][[1]]
## [1] "STRONG WIND" "STRONG WINDS"
##
##
## [[39]]
## [[39]][[1]]
## [1] "THUNDERSTORM WINDS" "THUNDERSTORM WIND" "THUNDERSTORM WINS"
## [4] "THUNDERSTORMS WINDS" "THUNDERSTORM WINDSS" "THUNDERTORM WINDS"
## [7] "THUNDERSTORM WINDS" "THUNDERSTORMS WIND" "TUNDERSTORM WIND"
## [10] "THUNDERSTORM WINDS." "THUNDERSTORM WIND." "THUDERSTORM WINDS"
## [13] "THUNDERESTORM WINDS" "THUNDEERSTORM WINDS" "THUNERSTORM WINDS"
## [16] "THUNDERSTORMWINDS" "THUNDERSTROM WIND"
##
##
## [[40]]
## [[40]][[1]]
## [1] "TORNADO" "TORNDAO" "TORNADOES"
##
##
## [[41]]
## [[41]][[1]]
## [1] "TROPICAL DEPRESSION"
##
##
## [[42]]
## [[42]][[1]]
## [1] "TROPICAL STORM"
##
##
## [[43]]
## [[43]][[1]]
## [1] "TSUNAMI"
##
##
## [[44]]
## [[44]][[1]]
## [1] "VOLCANIC ASH"
##
##
## [[45]]
## [[45]][[1]]
## [1] "WATERSPOUT" "WATERSPOUT-"
##
##
## [[46]]
## [[46]][[1]]
## [1] "WILD FIRES" "WILDFIRE" "WILDFIRES"
##
##
## [[47]]
## [[47]][[1]]
## [1] "WINTER STORM" "WINTER STORMS"
##
##
## [[48]]
## [[48]][[1]]
## [1] "WINTER WEATHER"
All the matches seem to be right except for the number 19 in the list since it is not really a typo but rather 2 different event types (HAIL and RAIN).
The function created in the next code chunk takes as argument a vector which specifies the position of the elements in the previously generated list that are correctly matched. Additionally argument dist is the value of maxDist to be used. This function finds all the matches with amatch function, amatch returns a vector of the same length than EVTYPE whose values represent the index in evtype_doc that is being matched to each EVTYPE character, if this index is in the vector given as argument then the function changes the EVTYPE value to the corresponding value in evtype_doc.
This function is called with the values 2, 14, 15, 16, 21, 24, 27, 29, 34, 38, 39, 40, 45, 46 and 47 in the vector argument, these numbers are the positions in evtype_doc that are correctly matched using a maxDist of 2.
replace_match <- function(vector, dist){
indices <- amatch(subset$EVTYPE, evtype_doc, maxDist = dist)
for (i in 1:length(indices)){
if ((indices[i] %in% vector) & (subset$EVTYPE[i]!= evtype_doc[indices[i]])){
subset$EVTYPE[i] <<- evtype_doc[indices[i]]
}
}
}
replace_match(c(2, 14, 15, 16, 21, 24, 27, 29, 34, 38, 39, 40, 45, 46, 47), 2)
length(unique(subset$EVTYPE))
## [1] 408
After replacing all the matching values we can see we still have 408 unique values. Same process is applied once more increasing maxDist value in the amatch function.
mat_groups(3)
## [[1]]
## [[1]][[1]]
## [1] "ASTRONOMICAL LOW TIDE"
##
##
## [[2]]
## [[2]][[1]]
## [1] "AVALANCHE"
##
##
## [[3]]
## [[3]][[1]]
## [1] "BLIZZARD"
##
##
## [[4]]
## [[4]][[1]]
## [1] "COASTAL FLOOD" "COASTAL FLOODING"
##
##
## [[5]]
## [[5]][[1]]
## [1] "COLD/WIND CHILL"
##
##
## [[6]]
## [[6]][[1]]
## character(0)
##
##
## [[7]]
## [[7]][[1]]
## [1] "DENSE FOG"
##
##
## [[8]]
## [[8]][[1]]
## [1] "DENSE SMOKE"
##
##
## [[9]]
## [[9]][[1]]
## [1] "DROUGHT"
##
##
## [[10]]
## [[10]][[1]]
## [1] "DUST DEVIL"
##
##
## [[11]]
## [[11]][[1]]
## [1] "DUST STORM"
##
##
## [[12]]
## [[12]][[1]]
## [1] "EXCESSIVE HEAT"
##
##
## [[13]]
## [[13]][[1]]
## [1] "EXTREME COLD/WIND CHILL"
##
##
## [[14]]
## [[14]][[1]]
## [1] "FLASH FLOODING" "FLASH FLOOD" "LAKE FLOOD"
##
##
## [[15]]
## [[15]][[1]]
## [1] "COLD" "FLOODING" "FLOOD" "FROST" "FOG"
##
##
## [[16]]
## [[16]][[1]]
## [1] "FROST/FREEZE"
##
##
## [[17]]
## [[17]][[1]]
## [1] "FUNNEL CLOUD"
##
##
## [[18]]
## [[18]][[1]]
## [1] "FREEZING FOG"
##
##
## [[19]]
## [[19]][[1]]
## [1] "HAIL" "HEAT" "HIGH" "RAIN" "HAIL 75"
##
##
## [[20]]
## [[20]][[1]]
## [1] "HAIL" "HEAT" "SLEET" "HIGH"
##
##
## [[21]]
## [[21]][[1]]
## [1] "HEAVY RAIN" "HEAVY MIX"
##
##
## [[22]]
## [[22]][[1]]
## [1] "HEAVY SNOW" "HEAVY SURF" "HEAVY SEAS" "HEAVY SHOWER"
##
##
## [[23]]
## [[23]][[1]]
## [1] "HIGH SEAS" "HIGH SURF" "ROUGH SURF"
##
##
## [[24]]
## [[24]][[1]]
## [1] "HIGH WIND" "HIGH WIND 48"
##
##
## [[25]]
## [[25]][[1]]
## [1] "HURRICANE/TYPHOON"
##
##
## [[26]]
## [[26]][[1]]
## [1] "ICE STORM" "WIND STORM"
##
##
## [[27]]
## [[27]][[1]]
## [1] "LAKE-EFFECT SNOW"
##
##
## [[28]]
## [[28]][[1]]
## [1] "LAKESHORE FLOOD"
##
##
## [[29]]
## [[29]][[1]]
## [1] "LIGHTNING"
##
##
## [[30]]
## [[30]][[1]]
## [1] "MARINE HAIL"
##
##
## [[31]]
## [[31]][[1]]
## [1] "MARINE HIGH WIND"
##
##
## [[32]]
## [[32]][[1]]
## [1] "MARINE STRONG WIND"
##
##
## [[33]]
## [[33]][[1]]
## [1] "MARINE THUNDERSTORM WIND"
##
##
## [[34]]
## [[34]][[1]]
## [1] "RIP CURRENT"
##
##
## [[35]]
## [[35]][[1]]
## [1] "ICE" "SEICHE"
##
##
## [[36]]
## [[36]][[1]]
## [1] "HEAT" "SLEET"
##
##
## [[37]]
## [[37]][[1]]
## [1] "STORM SURGE/TIDE"
##
##
## [[38]]
## [[38]][[1]]
## [1] "STRONG WIND"
##
##
## [[39]]
## [[39]][[1]]
## [1] "THUNDERSTORM WIND" "THUNDERSTORM WINDS53"
##
##
## [[40]]
## [[40]][[1]]
## [1] "TORNADO" "TORNADO F0" "TORNADO F3" "TORNADO F1" "TORNADO F2"
##
##
## [[41]]
## [[41]][[1]]
## [1] "TROPICAL DEPRESSION"
##
##
## [[42]]
## [[42]][[1]]
## [1] "TROPICAL STORM"
##
##
## [[43]]
## [[43]][[1]]
## [1] "TSUNAMI"
##
##
## [[44]]
## [[44]][[1]]
## [1] "VOLCANIC ASH"
##
##
## [[45]]
## [[45]][[1]]
## [1] "WATERSPOUT"
##
##
## [[46]]
## [[46]][[1]]
## [1] "WILDFIRE"
##
##
## [[47]]
## [[47]][[1]]
## [1] "WINTER STORM" "WIND STORM"
##
##
## [[48]]
## [[48]][[1]]
## [1] "WINTER WEATHER"
subset$EVTYPE <- gsub("LAKE FLOOD", "LAKESHORE FLOOD", subset$EVTYPE)
replace_match(c(4,14,23,24,25,39,40), 3)
length(unique(subset$EVTYPE))
## [1] 397
After applying mat_groups and noticing an event type called “LAKE FLOOD” being mismatched, all the values with that event type are renamed to “LAKESHORE FLOOD” which is a valid type according to documentation.
There are still 397 unique values. Since a typical error is “FLOODING”, that character will be changed to “FLOOD”.
unique(subset$EVTYPE)[grep("FLOODING", unique(subset$EVTYPE))]
## [1] "FLASH FLOODING/THUNDERSTORM WI" "FLOODING"
## [3] "BREAKUP FLOODING" "FLOODING/HEAVY RAIN"
## [5] "HEAVY SURF COASTAL FLOODING" "URBAN FLOODING"
## [7] "MINOR FLOODING" "ICE JAM FLOODING"
## [9] "RIVER FLOODING" "MUD SLIDES URBAN FLOODING"
## [11] "HEAVY RAINS/FLOODING" "THUNDERSTORM WINDS/FLOODING"
## [13] "SNOWMELT FLOODING" "FLASH FLOODING/FLOOD"
## [15] "TIDAL FLOODING" "COASTAL FLOODING/EROSION"
## [17] "COASTAL FLOODING/EROSION"
subset$EVTYPE <- gsub("FLOODING", "FLOOD", subset$EVTYPE)
length(unique(subset$EVTYPE))
## [1] 393
The function mat_groups is applied once more using 4 as maxDist.
mat_groups(4)
## [[1]]
## [[1]][[1]]
## [1] "ASTRONOMICAL HIGH TIDE" "ASTRONOMICAL LOW TIDE"
##
##
## [[2]]
## [[2]][[1]]
## [1] "AVALANCHE"
##
##
## [[3]]
## [[3]][[1]]
## [1] "BLIZZARD"
##
##
## [[4]]
## [[4]][[1]]
## [1] "COASTAL FLOOD" "COASTAL STORM"
##
##
## [[5]]
## [[5]][[1]]
## [1] "COLD/WIND CHILL"
##
##
## [[6]]
## [[6]][[1]]
## character(0)
##
##
## [[7]]
## [[7]][[1]]
## [1] "DENSE FOG" "DENSE SMOKE"
##
##
## [[8]]
## [[8]][[1]]
## [1] "DENSE FOG" "DENSE SMOKE"
##
##
## [[9]]
## [[9]][[1]]
## [1] "FROST" "DROUGHT"
##
##
## [[10]]
## [[10]][[1]]
## [1] "DUST DEVIL"
##
##
## [[11]]
## [[11]][[1]]
## [1] "DUST STORM" "ICE STORM" "WIND STORM"
##
##
## [[12]]
## [[12]][[1]]
## [1] "EXCESSIVE HEAT" "EXCESSIVE SNOW"
##
##
## [[13]]
## [[13]][[1]]
## [1] "EXTREME COLD/WIND CHILL"
##
##
## [[14]]
## [[14]][[1]]
## [1] "FLASH FLOOD"
##
##
## [[15]]
## [[15]][[1]]
## [1] "WIND" "COLD" "FLOOD" "SLEET" "SNOW" "FROST" "FOG" "GLAZE"
##
##
## [[16]]
## [[16]][[1]]
## [1] "FROST/FREEZE"
##
##
## [[17]]
## [[17]][[1]]
## [1] "FUNNEL CLOUD"
##
##
## [[18]]
## [[18]][[1]]
## [1] "FREEZING RAIN" "FREEZING FOG"
##
##
## [[19]]
## [[19]][[1]]
## [1] "HAIL" "HEAT" "WIND" "COLD" "HIGH" "SNOW"
## [7] "FOG" "ICE" "RAIN" "GLAZE" "HAIL 150" "HAIL 075"
## [13] "HAIL 100" "HAIL 125" "HAIL 200" "HAIL 75" "HAIL 175" "HAIL 275"
## [19] "HAIL 450" "?"
##
##
## [[20]]
## [[20]][[1]]
## [1] "HAIL" "HEAT" "WIND" "COLD" "SLEET" "HIGH" "SNOW" "FROST" "FOG"
## [10] "ICE" "RAIN" "GLAZE" "?" "OTHER"
##
##
## [[21]]
## [[21]][[1]]
## [1] "HEAVY RAIN" "HEAVY SNOW" "HEAVY SURF" "HEAVY MIX" "HEAVY SEAS"
##
##
## [[22]]
## [[22]][[1]]
## [1] "HEAVY RAIN" "HEAVY SNOW" "HEAVY SNOWPACK" "HEAVY SURF"
## [5] "HEAVY MIX" "HEAVY SNOW/ICE" "HEAVY SEAS" "HEAVY SHOWER"
##
##
## [[23]]
## [[23]][[1]]
## [1] "HIGH WIND" "HIGH SURF" "HEAVY SURF"
##
##
## [[24]]
## [[24]][[1]]
## [1] "TSTM WIND" "LIGHTNING" "HIGH WIND" "HIGH TIDES" "HIGH SURF"
## [6] "HAIL/WIND" "HIGH WAVES" "HIGH WATER" "WHIRLWIND"
##
##
## [[25]]
## [[25]][[1]]
## [1] "HURRICANE (TYPHOON)"
##
##
## [[26]]
## [[26]][[1]]
## [1] "WINTER STORM" "DUST STORM" "ICE STORM" "RAINSTORM" "WIND STORM"
## [6] "ICE FLOES" "ICE JAM" "HAILSTORM"
##
##
## [[27]]
## [[27]][[1]]
## [1] "LAKE-EFFECT SNOW"
##
##
## [[28]]
## [[28]][[1]]
## [1] "LAKESHORE FLOOD"
##
##
## [[29]]
## [[29]][[1]]
## [1] "LIGHTNING" "HIGH WIND" "LIGHT SNOW"
##
##
## [[30]]
## [[30]][[1]]
## [1] "MARINE HAIL"
##
##
## [[31]]
## [[31]][[1]]
## [1] "MARINE TSTM WIND" "MARINE HIGH WIND"
##
##
## [[32]]
## [[32]][[1]]
## [1] "MARINE STRONG WIND"
##
##
## [[33]]
## [[33]][[1]]
## [1] "MARINE THUNDERSTORM WIND"
##
##
## [[34]]
## [[34]][[1]]
## [1] "RIP CURRENT"
##
##
## [[35]]
## [[35]][[1]]
## [1] "HIGH" "ICE" "SEICHE"
##
##
## [[36]]
## [[36]][[1]]
## [1] "HEAT" "FLOOD" "FREEZE" "SLEET" "SNOW" "FROST" "ICE" "GLAZE"
## [9] "OTHER"
##
##
## [[37]]
## [[37]][[1]]
## [1] "STORM SURGE/TIDE"
##
##
## [[38]]
## [[38]][[1]]
## [1] "STRONG WIND"
##
##
## [[39]]
## [[39]][[1]]
## [1] "THUNDERSTORM WIND" "THUNDERSTORM WINDS 13" "THUNDERSTORM WIND G50"
## [4] "THUNDERSTORM WIND G55" "THUNDERSTORM WIND G60" "THUNDERSTORM HAIL"
## [7] "THUNDERSTORM WIND G52" "THUNDERSTORMW"
##
##
## [[40]]
## [[40]][[1]]
## [1] "TORNADO" "GUSTNADO" "TSUNAMI"
##
##
## [[41]]
## [[41]][[1]]
## [1] "TROPICAL DEPRESSION"
##
##
## [[42]]
## [[42]][[1]]
## [1] "TROPICAL STORM"
##
##
## [[43]]
## [[43]][[1]]
## [1] "TORNADO" "TSTMW" "TSUNAMI"
##
##
## [[44]]
## [[44]][[1]]
## [1] "VOLCANIC ASH"
##
##
## [[45]]
## [[45]][[1]]
## [1] "WATERSPOUT" "LANDSPOUT"
##
##
## [[46]]
## [[46]][[1]]
## [1] "WILDFIRE"
##
##
## [[47]]
## [[47]][[1]]
## [1] "WINTER STORM" "ICE STORM" "WIND STORM"
##
##
## [[48]]
## [[48]][[1]]
## [1] "WINTER WEATHER" "WINTER WEATHER MIX" "WINTER WEATHER/MIX"
Many of the matched characters correspond to another event type. So instead of applying the replace_match function, the next step consists in replacing the unique values we can see mismatched or not written in the official name, for example changing “HEAVY SNOWPACK” to “HEAVY SNOW”. Data documentation is useful here to decide which of the 48 official types fits better each of the unique values in EVTYPE.
subset$EVTYPE[grep("BEACH", subset$EVTYPE)] <- "COASTAL FLOOD"
subset$EVTYPE[grep("EROSION|(COASTAL|TIDAL) FLOOD", subset$EVTYPE)] <- "COASTAL FLOOD"
subset$EVTYPE[grep("FLASH", subset$EVTYPE)] <- "FLASH FLOOD"
subset$EVTYPE[grep("^FLOOD|(BREAKUP|URBAN|STREAM|RURAL|ICE JAM|AND|RIVER|MINOR|MAJOR|\\/|SNOWMELT)(.)?FLOOD", subset$EVTYPE)] <- "FLOOD"
subset$EVTYPE[grep("(EXTREME|RECORD|BITTER) COLD|^EXTREME(.*)CHILL", subset$EVTYPE)] <- "EXTREME COLD/WIND CHILL"
subset$EVTYPE[grep("FUNNEL", subset$EVTYPE)] <- "FUNNEL CLOUD"
subset$EVTYPE[grep("FROST|FREEZE", subset$EVTYPE)] <- "FROST/FREEZE"
subset$EVTYPE[grep("^(SNOW)|(HEAVY|LIGHT|BLOWING|LAKE|RECORD|AND|SEASON|EXCESSIVE|FALLING) (SNOW)|THUNDERSNOW|SNOWSTORM|\\/SNOW$|SNOWFALL$", subset$EVTYPE)] <- "HEAVY SNOW"
subset$EVTYPE[grep("WATERSPOUT", subset$EVTYPE)] <- "WATERSPOUT"
subset$EVTYPE[grep("TORNADO", subset$EVTYPE)] <- "TORNADO"
subset$EVTYPE[grep("^(((FOG AND|UNSEASONABLY|UNSEASONABLE|EXTENDED) )?COLD)|COLD$", subset$EVTYPE)] <- "COLD/WIND CHILL"
subset$EVTYPE[grep("^FOG", subset$EVTYPE)] <- "DENSE FOG"
subset$EVTYPE <- gsub("TSTM", "THUNDERSTORM", subset$EVTYPE)
subset$EVTYPE[grep("^(((THUNDERSTORM|SMALL) )?HAIL)", subset$EVTYPE)] <- "HAIL"
subset$EVTYPE[grep("^(((SEVERE) )?( )?THUNDERSTORM)", subset$EVTYPE)] <- "THUNDERSTORM WIND"
subset$EVTYPE <- gsub("GROUND BLIZZARD", "BLIZZARD", subset$EVTYPE)
subset$EVTYPE[grep("^HIGH WIND", subset$EVTYPE)] <- "HIGH WIND"
subset$EVTYPE[grep("^(((GUSTY|STORM FORCE|ICE\\/STRONG|GRADIENT|NON-SEVERE|NON THUNDERSTORM|NON-THUNDERSTORM|HEAVY SURF AND) )?WIND)", subset$EVTYPE)] <- "STRONG WIND"
subset$EVTYPE[grep("WHIRLWIND", subset$EVTYPE)] <- "TORNADO"
subset$EVTYPE <- gsub("MICROBURST", "DOWNBURST", subset$EVTYPE)
subset$EVTYPE[grep("DOWNBURST", subset$EVTYPE)] <- "DOWNBURST"
subset$EVTYPE[grep("LIGHTNING",subset$EVTYPE)] <- "LIGHTNING"
subset$EVTYPE[grep("^(((RECORD|HEAVY|EXCESSIVE|TORRENTIAL|UNSEASONAL) )?RAIN)",subset$EVTYPE)] <- "HEAVY RAIN"
subset$EVTYPE[grep("SLEET",subset$EVTYPE)] <- "SLEET"
subset$EVTYPE[grep("FREEZING RAIN",subset$EVTYPE)] <- "FREEZING RAIN"
subset <- subset[which(!(subset$EVTYPE=="?")),]
subset$EVTYPE[grep("ICE",subset$EVTYPE)] <- "ICE STORM"
subset$EVTYPE[grep("RIP CURRENT",subset$EVTYPE)] <- "RIP CURRENT"
subset$EVTYPE[grep("SURF|ROUGH SEAS|HEAVY SEAS",subset$EVTYPE)] <- "HIGH SURF"
subset$EVTYPE[grep("WARM|^HEAT",subset$EVTYPE)] <- "HEAT"
subset$EVTYPE[grep("(EXCESSIVE|EXTREME|RECORD) HEAT",subset$EVTYPE)] <- "EXCESSIVE HEAT"
subset$EVTYPE <- gsub("PRECIPITATION|SHOWER", "RAIN", subset$EVTYPE)
subset$EVTYPE <- gsub("SWELLS", "SURF", subset$EVTYPE)
subset$EVTYPE[grep("HIGH TIDE|RISING WATER|HIGH WATER",subset$EVTYPE)] <- "STORM SURGE/TIDE"
subset$EVTYPE <- gsub("WAVES", "SURF", subset$EVTYPE)
subset$EVTYPE[grep("HURRICANE|TYPHOON",subset$EVTYPE)] <- "HURRICANE (TYPHOON)"
subset$EVTYPE <- gsub("COASTALSTORM", "COASTAL STORM", subset$EVTYPE)
subset$EVTYPE[grep("SURGE",subset$EVTYPE)] <- "STORM SURGE/TIDE"
subset$EVTYPE[grep("TROPICAL STORM",subset$EVTYPE)] <- "TROPICAL STORM"
subset$EVTYPE[grep("WINTER STORM",subset$EVTYPE)] <- "WINTER STORM"
subset$EVTYPE[grep("WINTER WEATHER",subset$EVTYPE)] <- "WINTER WEATHER"
subset$EVTYPE[grep("LANDSLIDE|LANDSLUMP",subset$EVTYPE)] <- "DEBRIS FLOW"
subset$EVTYPE <- gsub("LANDSPOUT", "TORNADO", subset$EVTYPE)
There are still many unique values of EVTYPES in comparison with the official list of events. Next code chunk calls the values that have not yet been matched and assigns them to one of the types according to the data documentation and description in REMARKS variable.
length(unique(subset$EVTYPE))
## [1] 93
unique(subset$EVTYPE)[is.na(amatch(unique(subset$EVTYPE), evtype_doc, maxDist = 1))]
## [1] "MARINE MISHAP" "SEVERE TURBULENCE" "APACHE COUNTY"
## [4] "HIGH" "MUDSLIDES" "DRY MIRCOBURST WINDS"
## [7] "DOWNBURST" "COOL AND WET" "MUD SLIDE"
## [10] "MUD SLIDES" "EXCESSIVE WETNESS" "GUSTNADO"
## [13] "FREEZING RAIN" "GRASS FIRES" "GLAZE"
## [16] "DUST STORM/HIGH WINDS" "FOREST FIRES" "URBAN AND SMALL"
## [19] "MUDSLIDE" "ICY ROADS" "HEAVY MIX"
## [22] "LOW TEMPERATURE" "HYPOTHERMIA" "WILD/FOREST FIRE"
## [25] "FREEZING DRIZZLE" "WILD/FOREST FIRES" "URBAN/SMALL STREAM"
## [28] "BRUSH FIRE" "HEAVY SURF" "URBAN SMALL"
## [31] "OTHER" "URBAN/SML STREAM FLD" "MARINE ACCIDENT"
## [34] "COASTAL STORM" "WINTRY MIX" "MIXED PRECIP"
## [37] "FREEZING SPRAY" "HYPOTHERMIA/EXPOSURE" "MIXED RAIN"
## [40] "DAM BREAK" "HYPERTHERMIA/EXPOSURE" "ROCK SLIDE"
## [43] "ROGUE WAVE" "BLOWING DUST" "DROWNING"
subset$EVTYPE[grep("DOWNBURST|MIRCOBURST|GUSTNADO",subset$EVTYPE)] <- "THUNDERSTORM WIND"
subset$EVTYPE[grep("LOW TEMPERATURE|HYPOTHERMIA",subset$EVTYPE)] <- "COLD/WIND CHILL"
subset$EVTYPE[grep("SLIDE",subset$EVTYPE)] <- "DEBRIS FLOW"
subset$EVTYPE[grep("FIRE",subset$EVTYPE)] <- "WILDFIRE"
subset$EVTYPE <- gsub("FREEZING (RAIN|DRIZZLE|SPRAY)", "WINTER WEATHER", subset$EVTYPE)
subset$EVTYPE[grep("HYPERTHERMIA",subset$EVTYPE)] <- "EXCESSIVE HEAT"
subset$EVTYPE <- gsub("WINTRY MIX", "WINTER WEATHER", subset$EVTYPE)
subset$EVTYPE <- gsub("MIXED (RAIN|PRECIP)", "HEAVY RAIN", subset$EVTYPE)
subset$EVTYPE <- gsub("ROGUE WAVE", "HIGH SURF", subset$EVTYPE)
subset$EVTYPE <- gsub("BLOWING DUST", "DUST STORM", subset$EVTYPE)
subset$EVTYPE[subset$EVTYPE=="HIGH"] <- "HIGH WIND"
subset$EVTYPE[subset$EVTYPE=="COOL AND WET"] <- "COLD/WIND CHILL"
subset$EVTYPE[subset$EVTYPE=="DUST STORM/HIGH WINDS"] <- "DUST STORM"
subset$EVTYPE[subset$EVTYPE=="HEAVY MIX"] <- "WINTER WEATHER"
subset$EVTYPE[subset$EVTYPE=="GLAZE"] <- "WINTER WEATHER"
subset$EVTYPE[subset$EVTYPE=="URBAN AND SMALL"] <- "HEAVY RAIN"
subset$EVTYPE[subset$EVTYPE=="EXCESSIVE WETNESS"] <- "HEAVY RAIN"
subset$EVTYPE[subset$EVTYPE=="URBAN/SMALL STREAM"] <- "FLOOD"
subset$EVTYPE[grep("URBAN/SML STREAM FLD",subset$EVTYPE)] <- "FLOOD"
subset$EVTYPE[subset$EVTYPE=="DAM BREAK"] <- "HEAVY RAIN"
subset$EVTYPE[subset$EVTYPE=="APACHE COUNTY"] <- "THUNDERSTORM WIND"
subset$EVTYPE[subset$EVTYPE=="URBAN SMALL"] <- "FLOOD"
subset$EVTYPE[subset$EVTYPE=="MARINE ACCIDENT"] <- "HIGH SURF"
subset$EVTYPE[subset$EVTYPE=="DROWNING"] <- "HEAVY RAIN"
subset$EVTYPE[subset$EVTYPE=="ICY ROADS"] <- "WINTER WEATHER"
length(unique(subset$EVTYPE))
## [1] 53
Here the data is grouped by EVTYPE and then the FATALITIES and INJURIES are aggregated for each group.
harmful_ev <- subset %>% group_by(EVTYPE) %>% summarise(FATALITIES = sum(FATALITIES, na.rm = TRUE), INJURIES = sum(INJURIES, na.rm = TRUE))
## `summarise()` ungrouping output (override with `.groups` argument)
Next, the quantiles for FATALITIES and INJURIES are computed and the data set created previously is subsetted, including only the event types with FATALITIES/INJURIES values higher than 3 quartile (75%), the obtained dataframes are plotted.
quantile(harmful_ev$FATALITIES)
## 0% 25% 50% 75% 100%
## 0 0 28 168 5659
fat_sub <- harmful_ev[which(harmful_ev$FATALITIES >= quantile(harmful_ev$FATALITIES, probs = 0.75)),]
g <- ggplot(fat_sub, aes(EVTYPE, FATALITIES))
g + geom_col() + theme(axis.text.x = element_text(angle = 90)) + labs(title = "Fatalities by Event Type") + ylab("Fatalities") + xlab("Type of Event")
quantile(harmful_ev$INJURIES)
## 0% 25% 50% 75% 100%
## 0 2 72 1166 91364
inj_sub <- harmful_ev[which(harmful_ev$INJURIES >= quantile(harmful_ev$INJURIES, probs = 0.75)),]
d <- ggplot(inj_sub, aes(EVTYPE, INJURIES))
d + geom_col() + theme(axis.text.x = element_text(angle = 90)) + labs(title = "Injuries by Event Type") + ylab("Injuries") + xlab("Type of Event")
We can see from the plots that the most harmful type of event for population health according to the storm data is the tornado, it is the event that causes more fatalities and more injuries.
The letters and numbers in the variables PROPDMGEXP and CROPDMGEXP modify the values in the variables PORPDMG and CROPDMG. Here the variable PROPDMGEXP is formatted according to the information in the data documentation.
subset$PROPDMGEXP <- toupper(subset$PROPDMGEXP)
unique(subset$PROPDMGEXP)
## [1] "K" "M" "" "B" "+" "0" "5" "6" "4" "H" "2" "7" "3" "-"
subset$PROPDMGEXP <- factor(subset$PROPDMGEXP, levels = unique(subset$PROPDMGEXP), labels = c(1e3,1e6,0,1e9,1,10,10,10,10,100,10,10,10,0))
Next code chunk does the same for CROPDMGEXP variable.
subset$CROPDMGEXP <- toupper(subset$CROPDMGEXP)
unique(subset$CROPDMGEXP)
## [1] "" "M" "K" "B" "?" "0"
subset$CROPDMGEXP <- factor(subset$CROPDMGEXP, levels = unique(subset$CROPDMGEXP), labels = c(0,1e6,1e3,1e9,0,10))
Here we compute a new variable DAMAGE that aggregates property damage and crop damage.
subset$DAMAGE <- (subset$PROPDMG*(as.numeric(as.character(subset$PROPDMGEXP)))) + (subset$CROPDMG*(as.numeric(as.character(subset$CROPDMGEXP))))
Next we can see the first six values of this new variable arranged in descending order.
head(arrange(subset, desc(DAMAGE)))[c(1:7,9:10)]
## EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 FLOOD 0 0 115.00 1e+09 32.5 1e+06
## 2 STORM SURGE/TIDE 0 0 31.30 1e+09 0.0 0
## 3 HURRICANE (TYPHOON) 0 0 16.93 1e+09 0.0 0
## 4 STORM SURGE/TIDE 0 0 11.26 1e+09 0.0 0
## 5 FLOOD 0 0 5.00 1e+09 5.0 1e+09
## 6 HURRICANE (TYPHOON) 5 0 10.00 1e+09 0.0 0
## REFNUM DAMAGE
## 1 605943 115032500000
## 2 577616 31300000000
## 3 577615 16930000000
## 4 581535 11260000000
## 5 198375 10000000000
## 6 569288 10000000000
The first event seems to be an outlier since it has a very high amount in DAMAGE and PROPDMG. which is not consistent with the information on the remarks, here we subset the dataset again to remove this event.
arrange(subset, desc(DAMAGE))[1,]
## EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 FLOOD 0 0 115 1e+09 32.5 1e+06
## REMARKS
## 1 Major flooding continued into the early hours of January 1st, before the Napa River finally fell below flood stage and the water receeded. Flooding was severe in Downtown Napa from the Napa Creek and the City and Parks Department was hit with $6 million in damage alone. The City of Napa had 600 homes with moderate damage, 150 damaged businesses with costs of at least $70 million.
## REFNUM DAMAGE
## 1 605943 115032500000
subset <- subset[which(subset$REFNUM != 605943),]
Here the data set is grouped by EVTYPE and the variable DAMAGE is aggregated for each group.
damage_ev <- subset%>% group_by(EVTYPE) %>% summarise(DAMAGE = sum(DAMAGE, na.rm = TRUE))
## `summarise()` ungrouping output (override with `.groups` argument)
head(arrange(damage_ev, desc(DAMAGE)))
## # A tibble: 6 x 2
## EVTYPE DAMAGE
## <chr> <dbl>
## 1 HURRICANE (TYPHOON) 90872527810
## 2 TORNADO 58959416607
## 3 STORM SURGE/TIDE 47976010500
## 4 FLOOD 46046052482
## 5 HAIL 19021485177
## 6 FLASH FLOOD 18440128353
Next the 3rd quartile is used to extract only the event types with higher DAMAGE. Then, the new dataframe is plotted to visualize the most damaging events.
quantile(damage_ev$DAMAGE)
## 0% 25% 50% 75% 100%
## 0 980000 101545500 4175838990 90872527810
dmg_sub <- damage_ev[which(damage_ev$DAMAGE >= quantile(damage_ev$DAMAGE, probs = 0.75)),]
f <- ggplot(dmg_sub, aes(EVTYPE, DAMAGE))
f + geom_col() + theme(axis.text.x = element_text(angle = 90)) + labs(title = "Type of Events with Highest Economic Impact") + ylab("Economic impact($)") + xlab("Type of Event")
We can see that the type of event with the greatest economic consequences in the USA is hurricane.