After loading the original data, it was processed and cleaned according to the instructions in the Data Processing section of this document. After analysis of the dataset, two barplots were constructed to see which were the five top events that caused health damage and the five top events that caused economy damage. With the conclusions from this work, we hope better ways of prevention can be undertaken to minimize the impact of those weather events in human life and economy.
We will start by loading the dplyr package which will be very helpful in the processing of this data.
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
After that, we will start by checking if the data is in the working directory or not and proceed to download it from the original website and loading it.
if (!"StormData.csv.bz2" %in% dir(".")) {
url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(url, "StormData.csv.bz2")
}
if (!"data" %in% ls()){
data <- read.csv("StormData.csv.bz2")
}
Let’s start by checking the structure of this data
str(data)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
As we can see, it is quite a complex dataset. Let us try to simplify it a little before going into deep cleaning and processing. From reading the National Weather Service Storm Data Documentation (avaliable at https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf) and since our objective is to analyse the impact of weather event types in human health and economy, not all variables are of interest to us.
For our objective, the variables that will suffice are EVTYPE (for knowing the event type), FATALITIES and INJURIES (they are the main source of information about health damage), PROPDMG and CROPDMG (main source of information about economy damage as in propriety damage and crop damage) and CROPDMGEXP and PROPDMGEXP (allow us to evaluate the cost of damage in millions, billions, etc.) so it makes sense to subset only the important variables. We will also need the dates for reasons that will be explained soon.
data = select(data, BGN_DATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, CROPDMG, PROPDMGEXP, CROPDMGEXP)
head(data)
## BGN_DATE EVTYPE FATALITIES INJURIES PROPDMG CROPDMG
## 1 4/18/1950 0:00:00 TORNADO 0 15 25.0 0
## 2 4/18/1950 0:00:00 TORNADO 0 0 2.5 0
## 3 2/20/1951 0:00:00 TORNADO 0 2 25.0 0
## 4 6/8/1951 0:00:00 TORNADO 0 2 2.5 0
## 5 11/15/1951 0:00:00 TORNADO 0 2 2.5 0
## 6 11/15/1951 0:00:00 TORNADO 0 6 2.5 0
## PROPDMGEXP CROPDMGEXP
## 1 K
## 2 K
## 3 K
## 4 K
## 5 K
## 6 K
From the information that was given, we know that the events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records and thus more recent years should be considered more complete.
With this information, it is important to know the years for which we have more records and that can help us with better quality data. It’s also a good way to lessen the complexity of the dataset as well.
To do that, first we need to be able to get the years from the data (will need some processing of dates) and then build a histogram of the years and the frequency of events.
data$BGN_DATE <- as.Date(data$BGN_DATE, "%m/%d/%Y %H:%M:%S")
data$BGN_DATE <- as.numeric(format(data$BGN_DATE, "%Y"))
head(data)
## BGN_DATE EVTYPE FATALITIES INJURIES PROPDMG CROPDMG PROPDMGEXP
## 1 1950 TORNADO 0 15 25.0 0 K
## 2 1950 TORNADO 0 0 2.5 0 K
## 3 1951 TORNADO 0 2 25.0 0 K
## 4 1951 TORNADO 0 2 2.5 0 K
## 5 1951 TORNADO 0 2 2.5 0 K
## 6 1951 TORNADO 0 6 2.5 0 K
## CROPDMGEXP
## 1
## 2
## 3
## 4
## 5
## 6
Everything went well. Let us build the histogram next.
hist(data$BGN_DATE, main="Number of Events per Year", xlab="Year", ylab="Frequency of Events", col="red")
From observing the histogram, a good start point seems to subset data from this dataset starting from year 1990 until year 2011.
data <- filter(data, BGN_DATE >= 1990)
head(data)
## BGN_DATE EVTYPE FATALITIES INJURIES PROPDMG CROPDMG PROPDMGEXP
## 1 1990 HAIL 0 0 0.0 0
## 2 1990 TSTM WIND 0 0 0.0 0
## 3 1990 TSTM WIND 0 0 0.0 0
## 4 1990 TSTM WIND 0 0 0.0 0
## 5 1990 TORNADO 0 28 2.5 0 M
## 6 1990 TSTM WIND 0 0 0.0 0
## CROPDMGEXP
## 1
## 2
## 3
## 4
## 5
## 6
To lessen the complexity of this dataset even more, let us subset only the values where FATALITIES, INJURIES, PROPDMG or CROPDMG > 0 as those are the ones that we are interested in.
data <- filter(data, FATALITIES > 0 | INJURIES > 0 | PROPDMG > 0 | CROPDMG > 0)
head(data)
## BGN_DATE EVTYPE FATALITIES INJURIES PROPDMG CROPDMG PROPDMGEXP
## 1 1990 TORNADO 0 28 2.5 0 M
## 2 1990 TORNADO 0 0 25.0 0 K
## 3 1990 TORNADO 0 0 25.0 0 K
## 4 1990 TORNADO 0 3 2.5 0 M
## 5 1990 TORNADO 0 2 2.5 0 M
## 6 1990 TORNADO 0 15 2.5 0 M
## CROPDMGEXP
## 1
## 2
## 3
## 4
## 5
## 6
As we are interested in health damage and economy damage, we should add the damage by fatalities with the damage by injuries (new variable: HEALTHDMG) as well as the damage in proprieties and in crops (new variable: ECONDMG).
data <- mutate(data, HEALTHDMG = FATALITIES + INJURIES)
This takes care of the variable HEALTHDMG. For the ECONDMG, some more processing has to be done as the value of the sum of PROPDMG and CROPDMG has to take in account the values of CROPDMGEXP and PROPDMGEXP. From reading the documentation (link upper in the report), the meaning of those EXP is the following: “K” is thousands, “M” is millions and “B” is billions. Using the ifelse function we can build a variable that multiplies the damage cost by the value of that EXP or that equals the value of 1 (having no effect when the EXP is not specified) in the following way:
PROPFACTOR <- ifelse(data$PROPDMGEXP == "K", 1000, ifelse(data$PROPDMGEXP == "M", 1000000, ifelse(data$PROPDMGEXP == "B", 1000000000, 1)))
CROPFACTOR <- ifelse(data$CROPDMGEXP == "K", 1000, ifelse(data$CROPDMGEXP == "M", 1000000, ifelse(data$CROPDMGEXP == "B", 1000000000, 1)))
PROPDMG <- data$PROPDMG * PROPFACTOR
CROPDMG <- data$CROPDMG * CROPFACTOR
data$ECONDMG <- PROPDMG + CROPDMG
Taking another look at the data, everything seems to be alright.
head(data)
## BGN_DATE EVTYPE FATALITIES INJURIES PROPDMG CROPDMG PROPDMGEXP
## 1 1990 TORNADO 0 28 2.5 0 M
## 2 1990 TORNADO 0 0 25.0 0 K
## 3 1990 TORNADO 0 0 25.0 0 K
## 4 1990 TORNADO 0 3 2.5 0 M
## 5 1990 TORNADO 0 2 2.5 0 M
## 6 1990 TORNADO 0 15 2.5 0 M
## CROPDMGEXP HEALTHDMG ECONDMG
## 1 28 2500000
## 2 0 25000
## 3 0 25000
## 4 3 2500000
## 5 2 2500000
## 6 15 2500000
However if we look at the different categories of EVTYPES that exist in this dataset:
unique_evtype <- unique(data$EVTYPE)
str(unique_evtype)
## Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 856 244 429 972 410 786 406 409 290 ...
We see it is a factor with 985 levels, which is strange considering according to the documentation, there are only 48 categories (table 2.1.1 - page 6).
unique_evtype
## [1] TORNADO TSTM WIND
## [3] HAIL ICE STORM/FLASH FLOOD
## [5] WINTER STORM HURRICANE OPAL/HIGH WINDS
## [7] THUNDERSTORM WINDS HURRICANE ERIN
## [9] HURRICANE OPAL HEAVY RAIN
## [11] LIGHTNING THUNDERSTORM WIND
## [13] DENSE FOG RIP CURRENT
## [15] THUNDERSTORM WINS FLASH FLOODING
## [17] FLASH FLOOD TORNADO F0
## [19] THUNDERSTORM WINDS LIGHTNING THUNDERSTORM WINDS/HAIL
## [21] HEAT HIGH WINDS
## [23] WIND HEAVY RAINS
## [25] LIGHTNING AND HEAVY RAIN THUNDERSTORM WINDS HAIL
## [27] COLD HEAVY RAIN/LIGHTNING
## [29] FLASH FLOODING/THUNDERSTORM WI FLOODING
## [31] WATERSPOUT EXTREME COLD
## [33] LIGHTNING/HEAVY RAIN BREAKUP FLOODING
## [35] HIGH WIND FREEZE
## [37] RIVER FLOOD HIGH WINDS HEAVY RAINS
## [39] AVALANCHE MARINE MISHAP
## [41] HIGH TIDES HIGH WIND/SEAS
## [43] HIGH WINDS/HEAVY RAIN HIGH SEAS
## [45] COASTAL FLOOD SEVERE TURBULENCE
## [47] RECORD RAINFALL HEAVY SNOW
## [49] HEAVY SNOW/WIND DUST STORM
## [51] FLOOD APACHE COUNTY
## [53] SLEET DUST DEVIL
## [55] ICE STORM EXCESSIVE HEAT
## [57] THUNDERSTORM WINDS/FUNNEL CLOU GUSTY WINDS
## [59] FLOODING/HEAVY RAIN HEAVY SURF COASTAL FLOODING
## [61] HIGH SURF WILD FIRES
## [63] HIGH WINTER STORM HIGH WINDS
## [65] WINTER STORMS MUDSLIDES
## [67] RAINSTORM SEVERE THUNDERSTORM
## [69] SEVERE THUNDERSTORMS SEVERE THUNDERSTORM WINDS
## [71] THUNDERSTORMS WINDS FLOOD/FLASH FLOOD
## [73] FLOOD/RAIN/WINDS THUNDERSTORMS
## [75] FLASH FLOOD WINDS WINDS
## [77] FUNNEL CLOUD HIGH WIND DAMAGE
## [79] STRONG WIND HEAVY SNOWPACK
## [81] FLASH FLOOD/ HEAVY SURF
## [83] DRY MIRCOBURST WINDS DRY MICROBURST
## [85] URBAN FLOOD THUNDERSTORM WINDSS
## [87] MICROBURST WINDS HEAT WAVE
## [89] UNSEASONABLY WARM COASTAL FLOODING
## [91] STRONG WINDS BLIZZARD
## [93] WATERSPOUT/TORNADO WATERSPOUT TORNADO
## [95] STORM SURGE URBAN/SMALL STREAM FLOOD
## [97] WATERSPOUT- TORNADOES, TSTM WIND, HAIL
## [99] TROPICAL STORM ALBERTO TROPICAL STORM
## [101] TROPICAL STORM GORDON TROPICAL STORM JERRY
## [103] LIGHTNING THUNDERSTORM WINDS URBAN FLOODING
## [105] MINOR FLOODING WATERSPOUT-TORNADO
## [107] LIGHTNING INJURY LIGHTNING AND THUNDERSTORM WIN
## [109] FLASH FLOODS THUNDERSTORM WINDS53
## [111] WILDFIRE DAMAGING FREEZE
## [113] THUNDERSTORM WINDS 13 HURRICANE
## [115] SNOW LIGNTNING
## [117] FROST FREEZING RAIN/SNOW
## [119] HIGH WINDS/ THUNDERSNOW
## [121] FLOODS COOL AND WET
## [123] HEAVY RAIN/SNOW GLAZE ICE
## [125] MUD SLIDE HIGH WINDS
## [127] RURAL FLOOD MUD SLIDES
## [129] EXTREME HEAT DROUGHT
## [131] COLD AND WET CONDITIONS EXCESSIVE WETNESS
## [133] SLEET/ICE STORM GUSTNADO
## [135] FREEZING RAIN SNOW AND HEAVY SNOW
## [137] GROUND BLIZZARD EXTREME WIND CHILL
## [139] MAJOR FLOOD SNOW/HEAVY SNOW
## [141] FREEZING RAIN/SLEET ICE JAM FLOODING
## [143] COLD AIR TORNADO WIND DAMAGE
## [145] FOG TSTM WIND 55
## [147] SMALL STREAM FLOOD THUNDERTORM WINDS
## [149] HAIL/WINDS SNOW AND ICE
## [151] WIND STORM GRASS FIRES
## [153] LAKE FLOOD HAIL/WIND
## [155] WIND/HAIL ICE
## [157] SNOW AND ICE STORM THUNDERSTORM WINDS
## [159] WINTER WEATHER DROUGHT/EXCESSIVE HEAT
## [161] THUNDERSTORMS WIND TUNDERSTORM WIND
## [163] URBAN AND SMALL STREAM FLOODIN THUNDERSTORM WIND/LIGHTNING
## [165] HEAVY RAIN/SEVERE WEATHER THUNDERSTORM
## [167] WATERSPOUT/ TORNADO LIGHTNING.
## [169] HURRICANE-GENERATED SWELLS RIVER AND STREAM FLOOD
## [171] HIGH WINDS/COASTAL FLOOD RAIN
## [173] RIVER FLOODING ICE FLOES
## [175] THUNDERSTORM WIND G50 LIGHTNING FIRE
## [177] HEAVY LAKE SNOW RECORD COLD
## [179] HEAVY SNOW/FREEZING RAIN COLD WAVE
## [181] DUST DEVIL WATERSPOUT TORNADO F3
## [183] TORNDAO FLOOD/RIVER FLOOD
## [185] MUD SLIDES URBAN FLOODING TORNADO F1
## [187] GLAZE/ICE STORM GLAZE
## [189] HEAVY SNOW/WINTER STORM MICROBURST
## [191] AVALANCE BLIZZARD/WINTER STORM
## [193] DUST STORM/HIGH WINDS ICE JAM
## [195] FOREST FIRES FROST\\FREEZE
## [197] THUNDERSTORM WINDS. HVY RAIN
## [199] HAIL 150 HAIL 075
## [201] HAIL 100 THUNDERSTORM WIND G55
## [203] HAIL 125 THUNDERSTORM WIND G60
## [205] THUNDERSTORM WINDS G60 HARD FREEZE
## [207] HAIL 200 HEAVY SNOW AND HIGH WINDS
## [209] HEAVY SNOW/HIGH WINDS & FLOOD HEAVY RAIN AND FLOOD
## [211] RIP CURRENTS/HEAVY SURF URBAN AND SMALL
## [213] WILDFIRES FOG AND COLD TEMPERATURES
## [215] SNOW/COLD FLASH FLOOD FROM ICE JAMS
## [217] TSTM WIND G58 MUDSLIDE
## [219] HEAVY SNOW SQUALLS SNOW SQUALL
## [221] SNOW/ICE STORM HEAVY SNOW/SQUALLS
## [223] HEAVY SNOW-SQUALLS ICY ROADS
## [225] HEAVY MIX SNOW FREEZING RAIN
## [227] SNOW/SLEET SNOW/FREEZING RAIN
## [229] SNOW SQUALLS SNOW/SLEET/FREEZING RAIN
## [231] RECORD SNOW HAIL 0.75
## [233] RECORD HEAT THUNDERSTORM WIND 65MPH
## [235] THUNDERSTORM WIND/ TREES THUNDERSTORM WIND/AWNING
## [237] THUNDERSTORM WIND 98 MPH THUNDERSTORM WIND TREES
## [239] TORNADO F2 RIP CURRENTS
## [241] HURRICANE EMILY COASTAL SURGE
## [243] HURRICANE GORDON HURRICANE FELIX
## [245] THUNDERSTORM WIND 60 MPH THUNDERSTORM WINDS 63 MPH
## [247] THUNDERSTORM WIND/ TREE THUNDERSTORM DAMAGE TO
## [249] THUNDERSTORM WIND 65 MPH FLASH FLOOD - HEAVY RAIN
## [251] THUNDERSTORM WIND. FLASH FLOOD/ STREET
## [253] BLOWING SNOW HEAVY SNOW/BLIZZARD
## [255] THUNDERSTORM HAIL THUNDERSTORM WINDSHAIL
## [257] LIGHTNING WAUSEON THUDERSTORM WINDS
## [259] ICE AND SNOW STORM FORCE WINDS
## [261] HEAVY SNOW/ICE LIGHTING
## [263] HIGH WIND/HEAVY SNOW THUNDERSTORM WINDS AND
## [265] HEAVY PRECIPITATION HIGH WIND/BLIZZARD
## [267] TSTM WIND DAMAGE FLOOD FLASH
## [269] RAIN/WIND SNOW/ICE
## [271] HAIL 75 HEAT WAVE DROUGHT
## [273] HEAVY SNOW/BLIZZARD/AVALANCHE HEAT WAVES
## [275] UNSEASONABLY WARM AND DRY UNSEASONABLY COLD
## [277] RECORD/EXCESSIVE HEAT THUNDERSTORM WIND G52
## [279] HIGH WAVES FLASH FLOOD/FLOOD
## [281] FLOOD/FLASH LOW TEMPERATURE
## [283] HEAVY RAINS/FLOODING THUNDERESTORM WINDS
## [285] THUNDERSTORM WINDS/FLOODING HYPOTHERMIA
## [287] THUNDEERSTORM WINDS THUNERSTORM WINDS
## [289] HIGH WINDS/COLD COLD/WINDS
## [291] SNOW/ BITTER COLD COLD WEATHER
## [293] RAPIDLY RISING WATER WILD/FOREST FIRE
## [295] ICE/STRONG WINDS SNOW/HIGH WINDS
## [297] HIGH WINDS/SNOW SNOWMELT FLOODING
## [299] HEAVY SNOW AND STRONG WINDS SNOW ACCUMULATION
## [301] SNOW/ ICE SNOW/BLOWING SNOW
## [303] TORNADOES THUNDERSTORM WIND/HAIL
## [305] FREEZING DRIZZLE HAIL 175
## [307] FLASH FLOODING/FLOOD HAIL 275
## [309] HAIL 450 EXCESSIVE RAINFALL
## [311] THUNDERSTORMW HAILSTORM
## [313] TSTM WINDS TSTMW
## [315] TSTM WIND 65) TROPICAL STORM DEAN
## [317] THUNDERSTORM WINDS/ FLOOD LANDSLIDE
## [319] HIGH WIND AND SEAS THUNDERSTORMWINDS
## [321] WILD/FOREST FIRES HEAVY SEAS
## [323] HAIL DAMAGE FLOOD & HEAVY RAIN
## [325] ? THUNDERSTROM WIND
## [327] FLOOD/FLASHFLOOD HIGH WATER
## [329] HIGH WIND 48 LANDSLIDES
## [331] URBAN/SMALL STREAM BRUSH FIRE
## [333] HEAVY SHOWER HEAVY SWELLS
## [335] URBAN SMALL URBAN FLOODS
## [337] FLASH FLOOD/LANDSLIDE HEAVY RAIN/SMALL STREAM URBAN
## [339] FLASH FLOOD LANDSLIDES TSTM WIND/HAIL
## [341] Other Ice jam flood (minor
## [343] Tstm Wind URBAN/SML STREAM FLD
## [345] ROUGH SURF Heavy Surf
## [347] Dust Devil Marine Accident
## [349] Freeze Strong Wind
## [351] COASTAL STORM Erosion/Cstl Flood
## [353] River Flooding Damaging Freeze
## [355] Beach Erosion High Surf
## [357] Heavy Rain/High Surf Unseasonable Cold
## [359] Early Frost Wintry Mix
## [361] Extreme Cold Coastal Flooding
## [363] Torrential Rainfall Landslump
## [365] Hurricane Edouard Coastal Storm
## [367] TIDAL FLOODING Tidal Flooding
## [369] Strong Winds EXTREME WINDCHILL
## [371] Glaze Extended Cold
## [373] Whirlwind Heavy snow shower
## [375] Light snow Light Snow
## [377] MIXED PRECIP Freezing Spray
## [379] DOWNBURST Mudslides
## [381] Microburst Mudslide
## [383] Cold Coastal Flood
## [385] Snow Squalls Wind Damage
## [387] Light Snowfall Freezing Drizzle
## [389] Gusty wind/rain GUSTY WIND/HVY RAIN
## [391] Wind Cold Temperature
## [393] Heat Wave Snow
## [395] COLD AND SNOW RAIN/SNOW
## [397] TSTM WIND (G45) Gusty Winds
## [399] GUSTY WIND TSTM WIND 40
## [401] TSTM WIND 45 TSTM WIND (41)
## [403] TSTM WIND (G40) Frost/Freeze
## [405] AGRICULTURAL FREEZE OTHER
## [407] Hypothermia/Exposure HYPOTHERMIA/EXPOSURE
## [409] Lake Effect Snow Freezing Rain
## [411] Mixed Precipitation BLACK ICE
## [413] COASTALSTORM LIGHT SNOW
## [415] DAM BREAK Gusty winds
## [417] blowing snow GRADIENT WIND
## [419] TSTM WIND AND LIGHTNING gradient wind
## [421] Gradient wind Freezing drizzle
## [423] WET MICROBURST Heavy surf and wind
## [425] TYPHOON HIGH SWELLS
## [427] SMALL HAIL UNSEASONAL RAIN
## [429] COASTAL FLOODING/EROSION TSTM WIND (G45)
## [431] TSTM WIND (G45) HIGH WIND (G40)
## [433] TSTM WIND (G35) COASTAL EROSION
## [435] SEICHE COASTAL FLOODING/EROSION
## [437] HYPERTHERMIA/EXPOSURE WINTRY MIX
## [439] ROCK SLIDE GUSTY WIND/HAIL
## [441] TSTM WIND LANDSPOUT
## [443] EXCESSIVE SNOW LAKE EFFECT SNOW
## [445] FLOOD/FLASH/FLOOD MIXED PRECIPITATION
## [447] WIND AND WAVE LIGHT FREEZING RAIN
## [449] ICE ROADS ROUGH SEAS
## [451] TSTM WIND G45 NON-SEVERE WIND DAMAGE
## [453] WARM WEATHER THUNDERSTORM WIND (G40)
## [455] FLASH FLOOD LATE SEASON SNOW
## [457] WINTER WEATHER MIX ROGUE WAVE
## [459] FALLING SNOW/ICE NON-TSTM WIND
## [461] NON TSTM WIND BLOWING DUST
## [463] VOLCANIC ASH HIGH SURF ADVISORY
## [465] HAZARDOUS SURF WHIRLWIND
## [467] ICE ON ROAD DROWNING
## [469] EXTREME COLD/WIND CHILL MARINE TSTM WIND
## [471] HURRICANE/TYPHOON WINTER WEATHER/MIX
## [473] FROST/FREEZE ASTRONOMICAL HIGH TIDE
## [475] HEAVY SURF/HIGH SURF TROPICAL DEPRESSION
## [477] LAKE-EFFECT SNOW MARINE HIGH WIND
## [479] TSUNAMI STORM SURGE/TIDE
## [481] COLD/WIND CHILL LAKESHORE FLOOD
## [483] MARINE THUNDERSTORM WIND MARINE STRONG WIND
## [485] ASTRONOMICAL LOW TIDE DENSE SMOKE
## [487] MARINE HAIL FREEZING FOG
## 985 Levels: HIGH SURF ADVISORY COASTAL FLOOD ... WND
By checking this, we see that many categories are badly spelled or new categories were created. A full blown cleaning of this dataset would need a specialist to deal with all the meta data. I will now use regular expressions to try to join most of the scattered categories into the categories in the documentation.
data$EVTYPE[grep("^Aval", data$EVTYPE, ignore.case=TRUE)] <- "AVALANCHE"
data$EVTYPE[grep("^Blizz", data$EVTYPE, ignore.case=TRUE)] <- "BLIZZARD"
data$EVTYPE[grep("^Astronomic", data$EVTYPE, ignore.case=TRUE)] <- "ASTRONOMICAL LOW TIDE"
data$EVTYPE[grep("^Cold", data$EVTYPE, ignore.case=TRUE)] <- "COLD/WIND CHILL"
data$EVTYPE[grep("^Dry", data$EVTYPE, ignore.case=TRUE)] <- "HEAT"
data$EVTYPE[grep("Excessive Heat", data$EVTYPE, ignore.case=TRUE)] <- "EXCESSIVE HEAT"
data$EVTYPE[grep("Fire", data$EVTYPE, ignore.case=TRUE)] <- "WILDFIRE"
data$EVTYPE[grep("Marine", data$EVTYPE, ignore.case=TRUE)] <- "MARINE THUNDERSTORM WIND"
data$EVTYPE[grep("^Flood", data$EVTYPE, ignore.case=TRUE)] <- "FLOOD"
data$EVTYPE[grep("Fog", data$EVTYPE, ignore.case=TRUE)] <- "DENSE FOG"
data$EVTYPE[grep("Hurricane", data$EVTYPE, ignore.case=TRUE)] <- "HURRICANE"
data$EVTYPE[grep("Typh", data$EVTYPE, ignore.case=TRUE)] <- "HURRICANE"
data$EVTYPE[grep("Winter", data$EVTYPE, ignore.case=TRUE)] <- "WINTER STORM"
data$EVTYPE[grep("Torn", data$EVTYPE, ignore.case=TRUE)] <- "TORNADO"
data$EVTYPE[grep("^RIP", data$EVTYPE, ignore.case=TRUE)] <- "RIP CURRENT"
data$EVTYPE[grep("^TSTM", data$EVTYPE, ignore.case=TRUE)] <- "THUNDERSTORM WIND"
data$EVTYPE[grep("^Snow", data$EVTYPE, ignore.case=TRUE)] <- "HEAVY SNOW"
data$EVTYPE[grep("Snow$", data$EVTYPE, ignore.case=TRUE)] <- "LAKE-EFFECT SNOW"
data$EVTYPE[grep("Storm surge", data$EVTYPE, ignore.case=TRUE)] <- "STORM SURGE/TIDE"
data$EVTYPE[grep("^Ice", data$EVTYPE, ignore.case=TRUE)] <- "ICE STORM"
data$EVTYPE[grep("^Water", data$EVTYPE, ignore.case=TRUE)] <- "WATERSPOUT"
data$EVTYPE[grep("rain", data$EVTYPE, ignore.case=TRUE)] <- "HEAVY RAIN"
data$EVTYPE[grep("Mud", data$EVTYPE, ignore.case=TRUE)] <- "AVALANCHE"
data$EVTYPE[grep("FLD", data$EVTYPE, ignore.case=TRUE)] <- "FLOOD"
data$EVTYPE[grep("High Wind", data$EVTYPE, ignore.case=TRUE)] <- "HIGH WIND"
data$EVTYPE[grep("^HEAVY SNOW", data$EVTYPE, ignore.case=TRUE)] <- "HEAVY SNOW"
data$EVTYPE[intersect(grep("THUNDERSTORM WIND", data$EVTYPE, ignore.case=TRUE), grep("MARINE", data$EVTYPE, ignore.case=TRUE, invert=TRUE))] <- "THUNDERSTORM WIND"
Now that the hard part of cleaning and pre-processing the data is done, it is time to analyse the data and get some answers for the questions we have.
Using the dplyr package, let us group the dataset by event type. After that, it’s a matter of summing all the HEALTHDMG and ECONDMG according to the event type and ordering the dataset from higher value of damage to lower value.
by_event <- group_by(data, EVTYPE)
by_event <- summarise(by_event, HEALTHDMG = sum(HEALTHDMG), ECONDMG = sum(ECONDMG))
Starting with the analysis of the health damage and building a barplot of the top 5 events that cause health damage
by_event <- arrange(by_event, desc(HEALTHDMG))
barplot(by_event$HEALTHDMG[1:5], col = rainbow(5),
main = "Top 5 Events that cause Health Damage",
xlab = "Events",
ylab = "Health Damage caused")
legend("topright", legend = by_event$EVTYPE[1:5], fill = rainbow(5))
From this barplot, we can see that the top 5 events that cause health damage are in order: Tornadoes, Excessive Heat, Thunderstorm Winds, Floods and Lightning
Doing the same for the ECONDMG column:
by_event <- arrange(by_event, desc(ECONDMG))
barplot(by_event$ECONDMG[1:5], col = rainbow(5),
main = "Top 5 Events that cause Economic Damage",
xlab = "Events",
ylab = "Economic Damage caused")
legend("topright", legend = by_event$EVTYPE[1:5], fill = rainbow(5))
From this barplot, we conclude that the top 5 events that cause economic damage are in order: Floods, Hurricanes, Storm Surges/Tides, Tornadoes and Hail.