The original document describing the dataset can be loaded from here, the original dataset is here. Note that it’s 47Mb. If you can’t download any of these feel and you want to reproduce this analysis please contact me: “electric.blake at gmail dot com.”
In this paper we tried to analyse different types of natural events and their impact on nation’s health and nation’s economy. Many of these events result in fatalities, injuries, property damage and crop damage. It’s important to recognize those that have the highest impact and invest money and effort in preventing the consequences of these events. The dataset was created by the U.S. National Oceanic and Atmospheric Administration’s (NOAA). The most part of the data analysis is cleaning the data because most of it is dirty: names of eventa are inaccurate, formats of the damage columns are not suitable for analysis. So the main part is just data cleaning- as a result after cleaning we have a more consise dataset with just 48 types of events (as defined in the documentatin), but we lose some observations that don’t fit into any type of these categories. After cleaning we just summarize the data we have and identify the top 10 events that harm public health and economy the most.
Important note: Please note that most of the code output for this step is not shown simply because it’s too lengthy, but all the code I used for analysis is here, so you can simply run it on your machine if you want to check any of the steps.
So, let’s start. Before doing any analysis let’s look at the data we have. Let’s see the structure of this dataset. Note that it can take some time to load this dataset (it’s 47Mb even when zipped).
data = read.csv("repdata-data-StormData.csv.bz2", stringsAsFactors = FALSE)
str(data)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
So, we have 37 columns (we used only few in this analysis) and 902297 ovservations. Now let’s look at the summary and check if there is a lot of missing data - if yes, we will have to fix that.
summary(data)
So turns out that the only columns that have missing values are Latitude and Latitude_E. We don’t need them in our analysis, so it’s okay for them to be N/A.
The next step of cleaning the data is to check the variables we care about. The main variable we care about is EVTYPE. This is the type of the event and we can find the list of them on page 6 of the documentation. So we expect there to be about 48 types of events while in fact we have over 900:
dim(as.data.frame(table(data$EVTYPE)))
## [1] 985 2
head(as.data.frame(table(data$EVTYPE)), n = 30)
## Var1 Freq
## 1 HIGH SURF ADVISORY 1
## 2 COASTAL FLOOD 1
## 3 FLASH FLOOD 1
## 4 LIGHTNING 1
## 5 TSTM WIND 4
## 6 TSTM WIND (G45) 1
## 7 WATERSPOUT 1
## 8 WIND 1
## 9 ? 1
## 10 ABNORMAL WARMTH 4
## 11 ABNORMALLY DRY 2
## 12 ABNORMALLY WET 1
## 13 ACCUMULATED SNOWFALL 4
## 14 AGRICULTURAL FREEZE 6
## 15 APACHE COUNTY 1
## 16 ASTRONOMICAL HIGH TIDE 103
## 17 ASTRONOMICAL LOW TIDE 174
## 18 AVALANCE 1
## 19 AVALANCHE 386
## 20 BEACH EROSIN 1
## 21 Beach Erosion 1
## 22 BEACH EROSION 3
## 23 BEACH EROSION/COASTAL FLOOD 1
## 24 BEACH FLOOD 2
## 25 BELOW NORMAL PRECIPITATION 2
## 26 BITTER WIND CHILL 1
## 27 BITTER WIND CHILL TEMPERATURES 3
## 28 Black Ice 3
## 29 BLACK ICE 14
## 30 BLIZZARD 2719
So we’re in trouble…
And if we look carefully we’ll notice that this table is especially dirty because the names of the equivalent events are enterent in a number of different ways. For example, we have ‘WINTER STORM’ and ‘WINTER STORMS’. We need to handle it somehow. Here is the approach we’re going to use: we will create a character vector of vaild names that will match the list of evtypes from the original documentation. We will append names to these vector one by one while processing so that we can do the checksum in the end and check if we really ended up with 48 event type.
valid_evtypes = vector()
First we’ll deal with the summary. It clearly should not be here so we will get rid of these rows.
data = data[!(grepl("summary", data$EVTYPE, ignore.case = TRUE)),]
Then let’s get rid of questionmark (“?”) rows - we don’t know what type of even it is so we can’t analyse it:
data = data[!(grepl("\\?", data$EVTYPE, ignore.case = TRUE)),]
And then we will do the cleaning for all the other event types in a similar way, but this time I will replace the dirty values with the proper ones.
For astronomical low tide, there is also astronomical high tide and high tides but there is no category for that in the original documentation so we will just get rid of them in the end:
data$EVTYPE[grepl("low tide", data$EVTYPE, ignore.case = TRUE)] = "ASTRONOMICAL LOW TIDE"
valid_evtypes = c(valid_evtypes,"ASTRONOMICAL LOW TIDE")
For avalanche since there are no other mentions of avalanche:
data$EVTYPE[grepl("avalanc[h]?e", data$EVTYPE, ignore.case = TRUE)] = "AVALANCHE"
valid_evtypes = c(valid_evtypes,"AVALANCHE")
Now let’s deal with the blizzards:
table(data$EVTYPE[grep("blizzard", data$EVTYPE, ignore.case=TRUE)])
##
## BLIZZARD BLIZZARD AND EXTREME WIND CHIL
## 2719 2
## BLIZZARD AND HEAVY SNOW BLIZZARD WEATHER
## 1 1
## BLIZZARD/FREEZING RAIN BLIZZARD/HEAVY SNOW
## 1 2
## BLIZZARD/HIGH WIND BLIZZARD/WINTER STORM
## 1 1
## GROUND BLIZZARD HEAVY SNOW/BLIZZARD
## 2 3
## HIGH WIND/ BLIZZARD HIGH WIND/BLIZZARD
## 1 6
## HIGH WIND/BLIZZARD/FREEZING RA HIGH WIND/WIND CHILL/BLIZZARD
## 1 1
## Icestorm/Blizzard
## 1
The dominating event for all these is blizzard, so we will turn all these into blizzards:
blizzard = grepl("blizzard", data$EVTYPE, ignore.case = TRUE)
data$EVTYPE[blizzard] = "BLIZZARD"
valid_evtypes = c(valid_evtypes,"BLIZZARD")
rm(blizzard)
For coastal flood. There is also coastal surge and coastal storm, so we need to select only flood + coast. We can observer them by looking at the data - look at levels starting from:
head(as.data.frame(table(data$EVTYPE[grep("flood", data$EVTYPE, ignore.case=TRUE)])), n = 20)
## Var1 Freq
## 1 COASTAL FLOOD 1
## 2 FLASH FLOOD 1
## 3 BEACH EROSION/COASTAL FLOOD 1
## 4 BEACH FLOOD 2
## 5 BREAKUP FLOODING 1
## 6 COASTAL FLOODING/EROSION 1
## 7 Coastal Flood 6
## 8 COASTAL FLOOD 650
## 9 coastal flooding 2
## 10 Coastal Flooding 38
## 11 COASTAL FLOODING 143
## 12 COASTAL FLOODING/EROSION 5
## 13 COASTAL/TIDAL FLOOD 2
## 14 COASTALFLOOD 1
## 15 CSTL FLOODING/EROSION 2
## 16 Erosion/Cstl Flood 2
## 17 FLASH FLOOD 54277
## 18 FLASH FLOOD - HEAVY RAIN 2
## 19 FLASH FLOOD FROM ICE JAMS 5
## 20 FLASH FLOOD LANDSLIDES 1
If we add the two logical vectors for flood and for coast we’ll get the logical vector that will be matching all the coastal flood mentions in the EVTYPE.
flood = grepl("flood", data$EVTYPE, ignore.case = TRUE)
# coastal flood
coastal_flood = grepl("coastal", data$EVTYPE, ignore.case = TRUE) & flood
data$EVTYPE[coastal_flood] = "COASTAL FLOOD"
valid_evtypes = c(valid_evtypes,"COASTAL FLOOD")
#lakeshore fllod
lake_flood = grepl("lake", data$EVTYPE, ignore.case = TRUE) & flood
data$EVTYPE[lake_flood] = "LAKESHORE FLOOD"
valid_evtypes = c(valid_evtypes,"LAKESHORE FLOOD")
# flash flood
flash_flood = grepl("flash", data$EVTYPE, ignore.case = TRUE) & flood
data$EVTYPE[flash_flood] = "FLASH FLOOD"
valid_evtypes = c(valid_evtypes,"FLASH FLOOD")
# other floods:
just_flood = xor(flood, (coastal_flood | lake_flood | flash_flood))
data$EVTYPE[just_flood] = "FLOOD"
valid_evtypes = c(valid_evtypes,"FLOOD")
rm(flood, coastal_flood, lake_flood, flash_flood, just_flood)
For cold/wind chill we have different events in the documentation which are:
- cold/wind chill
- Extreme Cold/Wind Chill
So when looking for cold/wind chill we need to exclude: extreme, excessive and record
cold = grepl("cold", data$EVTYPE, ignore.case = TRUE)
chill = grepl("chil", data$EVTYPE, ignore.case = TRUE)
cold_chill = cold | chill #will match both chill and cold
extreme = grepl("extreme", data$EVTYPE, ignore.case = TRUE) & chill #will match those that have extrem AND chill
extreme = (grepl("extreme", data$EVTYPE, ignore.case = TRUE) & cold) | extreme
cold_chil_noextreme = xor(cold_chill, extreme)
#Same for excessive and record
record = grepl("record", data$EVTYPE, ignore.case = TRUE) & chill #will match those that have extrem AND chill
record = (grepl("record", data$EVTYPE, ignore.case = TRUE) & cold) | record
cold_chil_norecord = xor(cold_chil_noextreme, record)
severe = grepl("severe", data$EVTYPE, ignore.case = TRUE) & chill #will match those that have extrem AND chill
severe = (grepl("severe", data$EVTYPE, ignore.case = TRUE) & cold) | severe
cold_chil_nosevere = xor(cold_chil_norecord, severe)
data$EVTYPE[cold_chil_nosevere] = "COLD/WIND CHILL"
valid_evtypes = c(valid_evtypes,"COLD/WIND CHILL")
rm(cold, chill, cold_chill, extreme, cold_chil_noextreme, record, cold_chil_norecord, severe, cold_chil_nosevere)
table(data$EVTYPE[grepl("debris", levels(data$EVTYPE), ignore.case=TRUE)])
## < table of extent 0 >
debris = grepl("debris", data$EVTYPE, ignore.case = TRUE)
data$EVTYPE[debris] = "DEBRIS FLOW"
valid_evtypes = c(valid_evtypes,"DEBRIS FLOW")
We will include dense fog and just ‘fog’ here:
table(data$EVTYPE[grep("fog", data$EVTYPE, ignore.case=TRUE)])
##
## DENSE FOG FOG Freezing Fog FREEZING FOG
## 1293 538 1 45
## Ice Fog PATCHY DENSE FOG
## 2 3
fog = grepl("fog", data$EVTYPE, ignore.case = TRUE)
freeze_fog = (grepl("freeze", data$EVTYPE, ignore.case = TRUE) | grepl("ice", data$EVTYPE, ignore.case = TRUE)) & fog
fog_nofreeze = xor(fog, freeze_fog)
data$EVTYPE[fog_nofreeze] = "DENSE FOG"
valid_evtypes = c(valid_evtypes,"DENSE FOG")
data$EVTYPE[freeze_fog] = "FREEZING FOG"
valid_evtypes = c(valid_evtypes,"FREEZING FOG")
rm(fog, freeze_fog, fog_nofreeze)
table(data$EVTYPE[grep("smoke", data$EVTYPE, ignore.case=TRUE)])
##
## DENSE SMOKE SMOKE
## 10 11
smoke = grepl("smoke", data$EVTYPE, ignore.case = TRUE)
data$EVTYPE[smoke] = "DENSE SMOKE"
valid_evtypes = c(valid_evtypes,"DENSE SMOKE")
rm(smoke)
Since it’s just drought in the events list, we will selec all drought values we have:
table(data$EVTYPE[grep("drought", data$EVTYPE, ignore.case=TRUE)])
##
## DROUGHT DROUGHT/EXCESSIVE HEAT EXCESSIVE HEAT/DROUGHT
## 2488 13 1
## HEAT DROUGHT HEAT WAVE DROUGHT HEAT/DROUGHT
## 1 1 1
## SNOW DROUGHT
## 7
drought = grepl("drought", data$EVTYPE, ignore.case = TRUE)
data$EVTYPE[drought] = "DROUGHT"
valid_evtypes = c(valid_evtypes,"DROUGHT")
rm(drought)
table(data$EVTYPE[grep("dust", data$EVTYPE, ignore.case=TRUE)])
##
## BLOWING DUST DUST DEVEL Dust Devil
## 4 1 8
## DUST DEVIL DUST DEVIL WATERSPOUT DUST STORM
## 141 1 427
## DUST STORM/HIGH WINDS DUSTSTORM HIGH WINDS DUST STORM
## 1 1 1
## Saharan Dust SAHARAN DUST
## 2 2
dust = grepl("dust", data$EVTYPE, ignore.case = TRUE)
dust_devil = grepl("devil", data$EVTYPE, ignore.case = TRUE) & dust
dust = xor(dust_devil,dust)
data$EVTYPE[dust] = "DUST STORM"
data$EVTYPE[dust_devil] = "DUST DEVIL"
valid_evtypes = c(valid_evtypes,"DUST STORM")
valid_evtypes = c(valid_evtypes,"DUST DEVIL")
rm(dust, dust_devil)
table(data$EVTYPE[grep("heat", data$EVTYPE, ignore.case=TRUE)])
##
## EXCESSIVE HEAT EXTREME HEAT HEAT
## 1678 22 767
## Heat Wave HEAT WAVE HEAT WAVES
## 1 74 2
## Heatburst Record Heat RECORD HEAT
## 1 1 81
## RECORD HEAT WAVE RECORD/EXCESSIVE HEAT
## 1 3
heat = grepl("heat", data$EVTYPE, ignore.case = TRUE)
extreme = grepl("extreme", data$EVTYPE, ignore.case = TRUE) | grepl("excessive", data$EVTYPE, ignore.case = TRUE) | grepl("record", data$EVTYPE, ignore.case = TRUE)
extreme_heat = heat & extreme
data$EVTYPE[extreme_heat] = "EXCESSIVE HEAT"
just_heat = xor(heat, extreme_heat)
data$EVTYPE[just_heat] = "HEAT"
valid_evtypes = c(valid_evtypes,"EXCESSIVE HEAT")
valid_evtypes = c(valid_evtypes,"HEAT")
rm(heat, extreme, extreme_heat, just_heat)
fog = grepl("fog", data$EVTYPE, ignore.case = TRUE)
freeze_fog = (grepl("freeze", data$EVTYPE, ignore.case = TRUE)) & fog
frost_freeze = grepl("frost", data$EVTYPE, ignore.case = TRUE) | grepl("freeze", data$EVTYPE, ignore.case = TRUE)
frost_freeze = xor(frost_freeze, freeze_fog)
data$EVTYPE[frost_freeze] = "FROST/FREEZE"
valid_evtypes = c(valid_evtypes,"FROST/FREEZE")
rm(fog, freeze_fog, frost_freeze)
table(data$EVTYPE[grep("funnel", data$EVTYPE, ignore.case=TRUE)])
##
## FUNNEL Funnel Cloud
## 46 5
## FUNNEL CLOUD FUNNEL CLOUD.
## 6839 1
## FUNNEL CLOUD/HAIL FUNNEL CLOUDS
## 1 87
## FUNNELS THUNDERSTORM WINDS FUNNEL CLOU
## 1 2
## THUNDERSTORM WINDS/FUNNEL CLOU WALL CLOUD/FUNNEL CLOUD
## 1 1
## WATERSPOUT FUNNEL CLOUD
## 1
funnel = (grepl("funnel", data$EVTYPE, ignore.case = TRUE))
cloud = (grepl("clou", data$EVTYPE, ignore.case = TRUE))
funnel = funnel&cloud
data$EVTYPE[funnel] = "FUNNEL CLOUD"
valid_evtypes = c(valid_evtypes,"FUNNEL CLOUD")
rm(funnel, cloud)
marine_hail = grepl("marine hail", data$EVTYPE, ignore.case = TRUE)
hail = grepl("hail", data$EVTYPE, ignore.case = TRUE)
#get non-marine hail:
just_hail = xor(hail,marine_hail)
data$EVTYPE[marine_hail] = "MARINE HAIL"
data$EVTYPE[just_hail] = "HAIL"
valid_evtypes = c(valid_evtypes,"MARINE HAIL")
valid_evtypes = c(valid_evtypes,"HAIL")
rm(marine_hail, hail, just_hail)
head(as.data.frame(table(data$EVTYPE[grepl("heavy", data$EVTYPE, ignore.case=TRUE)])), n = 20)
## Var1 Freq
## 1 HEAVY LAKE SNOW 25
## 2 HEAVY MIX 8
## 3 HEAVY PRECIPATATION 1
## 4 Heavy Precipitation 2
## 5 HEAVY PRECIPITATION 1
## 6 Heavy rain 3
## 7 Heavy Rain 16
## 8 HEAVY RAIN 11723
## 9 Heavy Rain and Wind 4
## 10 HEAVY RAIN EFFECTS 1
## 11 Heavy Rain/High Surf 1
## 12 HEAVY RAIN/LIGHTNING 1
## 13 HEAVY RAIN/SEVERE WEATHER 2
## 14 HEAVY RAIN/SMALL STREAM URBAN 1
## 15 HEAVY RAIN/SNOW 1
## 16 HEAVY RAIN/WIND 4
## 17 HEAVY RAINFALL 3
## 18 HEAVY RAINS 26
## 19 HEAVY SEAS 2
## 20 HEAVY SHOWER 2
heavy = grepl("heavy", data$EVTYPE, ignore.case = TRUE)
snow = grepl("snow", data$EVTYPE, ignore.case = TRUE)
rain = grepl("rain", data$EVTYPE, ignore.case = TRUE)
snow_rain = snow&rain
heavy_snow = xor(heavy&snow, snow_rain)
heavy_rain = xor(heavy&rain, snow_rain)
data$EVTYPE[heavy_snow] = "HEAVY SNOW"
data$EVTYPE[heavy_rain] = "HEAVY RAIN"
valid_evtypes = c(valid_evtypes,"HEAVY SNOW")
valid_evtypes = c(valid_evtypes,"HEAVY RAIN")
rm(heavy, snow, rain, snow_rain, heavy_snow, heavy_rain)
head(as.data.frame(table(data$EVTYPE[grep("high", data$EVTYPE, ignore.case=TRUE)])), n = 20)
## Var1 Freq
## 1 HIGH SURF ADVISORY 1
## 2 ASTRONOMICAL HIGH TIDE 103
## 3 HEAVY SURF/HIGH SURF 228
## 4 HIGH 1
## 5 HIGH SWELLS 1
## 6 HIGH WINDS 1
## 7 HIGH SEAS 8
## 8 High Surf 9
## 9 HIGH SURF 725
## 10 HIGH SURF ADVISORIES 1
## 11 HIGH SURF ADVISORY 4
## 12 HIGH SWELLS 5
## 13 HIGH TEMPERATURE RECORD 3
## 14 HIGH TIDES 2
## 15 HIGH WATER 6
## 16 HIGH WAVES 3
## 17 High Wind 2
## 18 HIGH WIND 20212
## 19 HIGH WIND (G40) 2
## 20 HIGH WIND 48 1
high = grepl("high", data$EVTYPE, ignore.case = TRUE)
wind = grepl("wind", data$EVTYPE, ignore.case = TRUE)
surf = grepl("surf", data$EVTYPE, ignore.case = TRUE)
wind_surf = wind&surf
high_wind = xor(high&wind, wind_surf)
high_surf = xor(high&surf, wind_surf)
data$EVTYPE[high_wind] = "HIGH WIND"
data$EVTYPE[high_surf] = "HIGH SURF"
valid_evtypes = c(valid_evtypes,"HIGH WIND")
valid_evtypes = c(valid_evtypes,"HIGH SURF")
rm(high, wind, surf, wind_surf, high_wind, high_surf)
table(data$EVTYPE[grep("hurricane", data$EVTYPE, ignore.case=TRUE)])
##
## HURRICANE HURRICANE-GENERATED SWELLS
## 174 3
## Hurricane Edouard HURRICANE EMILY
## 2 1
## HURRICANE ERIN HURRICANE FELIX
## 7 2
## HURRICANE GORDON HURRICANE OPAL
## 1 9
## HURRICANE/TYPHOON
## 88
table(data$EVTYPE[grep("typhoon", data$EVTYPE, ignore.case=TRUE)])
##
## HURRICANE/TYPHOON TYPHOON
## 88 11
hurricane_typhoon = grepl("hurricane", data$EVTYPE, ignore.case = TRUE) | grepl("typhoon", data$EVTYPE, ignore.case = TRUE)
data$EVTYPE[hurricane_typhoon] = "HURRICANE"
valid_evtypes = c(valid_evtypes,"HURRICANE")
rm(hurricane_typhoon)
table(data$EVTYPE[grep("ice storm", data$EVTYPE, ignore.case=TRUE)])
##
## GLAZE/ICE STORM ICE STORM ICE STORM AND SNOW
## 1 2006 1
## SLEET/ICE STORM SNOW AND ICE STORM SNOW/ICE STORM
## 1 1 17
ice_storm = grepl("ice storm", data$EVTYPE, ignore.case = TRUE)
data$EVTYPE[ice_storm] = "ICE STORM"
valid_evtypes = c(valid_evtypes,"ICE STORM")
rm(ice_storm)
table(data$EVTYPE[grep("lake", data$EVTYPE, ignore.case=TRUE)])
##
## GUSTY LAKE WIND LAKE-EFFECT SNOW Lake Effect Snow LAKE EFFECT SNOW
## 1 636 2 21
## LAKESHORE FLOOD
## 24
lake_effect = grepl("lake", data$EVTYPE, ignore.case = TRUE) & grepl("effect", data$EVTYPE, ignore.case = TRUE) & grepl("snow", data$EVTYPE, ignore.case = TRUE)
data$EVTYPE[lake_effect] = "LAKE-EFFECT SNOW"
valid_evtypes = c(valid_evtypes,"LAKE-EFFECT SNOW")
rm(lake_effect)
table(data$EVTYPE[grep("lightning", data$EVTYPE, ignore.case=TRUE)])
##
## LIGHTNING LIGHTNING
## 1 15754
## LIGHTNING WAUSEON LIGHTNING AND THUNDERSTORM WIN
## 1 1
## LIGHTNING AND WINDS LIGHTNING DAMAGE
## 1 1
## LIGHTNING FIRE LIGHTNING INJURY
## 1 1
## LIGHTNING THUNDERSTORM WINDS LIGHTNING THUNDERSTORM WINDSS
## 1 1
## LIGHTNING. THUNDERSTORM WIND/LIGHTNING
## 1 1
## THUNDERSTORM WINDS LIGHTNING TSTM WIND AND LIGHTNING
## 7 1
light = grepl("lightning", data$EVTYPE, ignore.case = TRUE)
light_storm = grepl("lightning", data$EVTYPE, ignore.case = TRUE)&(grepl("storm", data$EVTYPE, ignore.case = TRUE)|grepl("tstm", data$EVTYPE, ignore.case = TRUE))
light = xor(light,light_storm)
data$EVTYPE[light] = "LIGHTNING"
valid_evtypes = c(valid_evtypes,"LIGHTNING")
rm(light, light_storm)
table(data$EVTYPE[grep("marine", data$EVTYPE, ignore.case=TRUE)])
##
## Marine Accident MARINE HAIL MARINE MISHAP
## 1 442 2
## MARINE STRONG WIND MARINE THUNDERSTORM WIND MARINE TSTM WIND
## 48 5812 6175
marine_strong_wind = grepl("marine", data$EVTYPE, ignore.case=TRUE) & grepl("wind", data$EVTYPE, ignore.case=TRUE) & grepl("strong", data$EVTYPE, ignore.case=TRUE)
marine_high_wind = grepl("marine", data$EVTYPE, ignore.case=TRUE) & grepl("wind", data$EVTYPE, ignore.case=TRUE) & grepl("high", data$EVTYPE, ignore.case=TRUE)
marine_tstm_wind = grepl("marine", data$EVTYPE, ignore.case=TRUE) & grepl("wind", data$EVTYPE, ignore.case=TRUE) & (grepl("tstm", data$EVTYPE, ignore.case=TRUE)|grepl("thunderstorm", data$EVTYPE, ignore.case=TRUE))
data$EVTYPE[marine_strong_wind] = "MARINE STRONG WIND"
data$EVTYPE[marine_high_wind] = "MARINE HIGH WIND"
data$EVTYPE[marine_tstm_wind] = "MARINE THUNDERSTORM WIND"
valid_evtypes = c(valid_evtypes,"MARINE STRONG WIND")
valid_evtypes = c(valid_evtypes,"MARINE HIGH WIND")
valid_evtypes = c(valid_evtypes,"MARINE THUNDERSTORM WIND")
table(data$EVTYPE[grep("rip", data$EVTYPE, ignore.case=TRUE)])
##
## RIP CURRENT RIP CURRENTS RIP CURRENTS HEAVY SURF
## 470 304 1
## RIP CURRENTS/HEAVY SURF
## 2
rip_current = grepl("rip current", data$EVTYPE, ignore.case=TRUE)
data$EVTYPE[rip_current] = "RIP CURRENT"
valid_evtypes = c(valid_evtypes,"RIP CURRENT")
rm(rip_current)
table(data$EVTYPE[grep("seiche", data$EVTYPE, ignore.case=TRUE)])
##
## SEICHE
## 21
valid_evtypes = c(valid_evtypes,"SEICHE")
table(data$EVTYPE[grep("sleet", data$EVTYPE, ignore.case=TRUE)])
##
## FREEZING RAIN AND SLEET FREEZING RAIN SLEET AND
## 6 1
## FREEZING RAIN SLEET AND LIGHT FREEZING RAIN/SLEET
## 1 9
## LIGHT SNOW AND SLEET SLEET
## 2 59
## SLEET & FREEZING RAIN SLEET STORM
## 1 12
## SLEET/FREEZING RAIN SLEET/SNOW
## 2 2
## Snow and sleet SNOW AND SLEET
## 1 4
## SNOW SLEET SNOW/SLEET
## 1 10
sleet = grepl("sleet", data$EVTYPE, ignore.case=TRUE)
data$EVTYPE[sleet] = "SLEET"
valid_evtypes = c(valid_evtypes,"SLEET")
rm(sleet)
table(data$EVTYPE[grep("storm surge", data$EVTYPE, ignore.case=TRUE)])
##
## STORM SURGE STORM SURGE/TIDE
## 261 148
storm_surge = grepl("storm surge", data$EVTYPE, ignore.case=TRUE)
data$EVTYPE[storm_surge] = "STORM SURGE/TIDE"
valid_evtypes = c(valid_evtypes,"STORM SURGE/TIDE")
rm(storm_surge)
table(data$EVTYPE[grep("strong wind", data$EVTYPE, ignore.case=TRUE)])
##
## ICE/STRONG WINDS MARINE STRONG WIND Strong Wind
## 1 48 3
## STRONG WIND STRONG WIND GUST Strong winds
## 3566 2 1
## Strong Winds STRONG WINDS
## 7 196
strong_wind = grepl("strong wind", data$EVTYPE, ignore.case=TRUE)
marine = grepl("strong wind", data$EVTYPE, ignore.case=TRUE) & grepl("marine", data$EVTYPE, ignore.case=TRUE)
strong_wind = xor(strong_wind, marine)
data$EVTYPE[strong_wind] = "STRONG WIND"
valid_evtypes = c(valid_evtypes,"STRONG WIND")
rm(strong_wind)
head(as.data.frame(table(data$EVTYPE[grep("thunderstorm wind", data$EVTYPE, ignore.case=TRUE)])),n=20)
## Var1 Freq
## 1 GUSTY THUNDERSTORM WIND 3
## 2 GUSTY THUNDERSTORM WINDS 5
## 3 LIGHTNING THUNDERSTORM WINDS 1
## 4 LIGHTNING THUNDERSTORM WINDSS 1
## 5 MARINE THUNDERSTORM WIND 11987
## 6 SEVERE THUNDERSTORM WINDS 5
## 7 Thunderstorm Wind 1
## 8 THUNDERSTORM WIND 82563
## 9 THUNDERSTORM WIND (G40) 1
## 10 THUNDERSTORM WIND 50 2
## 11 THUNDERSTORM WIND 52 1
## 12 THUNDERSTORM WIND 56 1
## 13 THUNDERSTORM WIND 59 1
## 14 THUNDERSTORM WIND 59 MPH 1
## 15 THUNDERSTORM WIND 59 MPH. 1
## 16 THUNDERSTORM WIND 60 MPH 4
## 17 THUNDERSTORM WIND 65 MPH 1
## 18 THUNDERSTORM WIND 65MPH 1
## 19 THUNDERSTORM WIND 69 1
## 20 THUNDERSTORM WIND 98 MPH 1
table(data$EVTYPE[grep("tstm", data$EVTYPE, ignore.case=TRUE)])
##
## TSTM WIND TSTM WIND (G45) NON-TSTM WIND
## 4 1 1
## NON TSTM WIND TSTM Tstm Wind
## 2 1 2
## TSTM WIND TSTM WIND (G45) TSTM WIND (41)
## 219940 1 1
## TSTM WIND (G35) TSTM WIND (G40) TSTM WIND (G45)
## 1 10 39
## TSTM WIND 40 TSTM WIND 45 TSTM WIND 50
## 1 1 1
## TSTM WIND 51 TSTM WIND 52 TSTM WIND 55
## 2 5 3
## TSTM WIND 65) TSTM WIND AND LIGHTNING TSTM WIND DAMAGE
## 1 1 1
## TSTM WIND G45 TSTM WIND G58 TSTM WINDS
## 1 1 6
## TSTM WND TSTMW
## 1 1
tstm_wind = grepl("thunderstorm wind", data$EVTYPE, ignore.case=TRUE)|grepl("tstm wind", data$EVTYPE, ignore.case=TRUE)
data$EVTYPE[tstm_wind] = "THUNDERSTORM WIND"
valid_evtypes = c(valid_evtypes,"THUNDERSTORM WIND")
rm(tstm_wind)
table(data$EVTYPE[grep("tornado", data$EVTYPE, ignore.case=TRUE)])
##
## TORNADO TORNADO F0 TORNADO F1
## 60652 19 4
## TORNADO F2 TORNADO F3 TORNADO/WATERSPOUT
## 3 2 1
## TORNADOES TORNADOS WATERSPOUT-TORNADO
## 2 1 2
## WATERSPOUT TORNADO WATERSPOUT/ TORNADO WATERSPOUT/TORNADO
## 1 2 8
tornado = grepl("tornado", data$EVTYPE, ignore.case=TRUE)
data$EVTYPE[tornado] = "TORNADO"
valid_evtypes = c(valid_evtypes,"TORNADO")
rm(tornado)
table(data$EVTYPE[grep("tropical depression", data$EVTYPE, ignore.case=TRUE)])
##
## TROPICAL DEPRESSION
## 60
valid_evtypes = c(valid_evtypes,"TROPICAL DEPRESSION")
table(data$EVTYPE[grep("tropical storm", data$EVTYPE, ignore.case=TRUE)])
##
## TROPICAL STORM TROPICAL STORM ALBERTO TROPICAL STORM DEAN
## 690 1 2
## TROPICAL STORM GORDON TROPICAL STORM JERRY
## 1 3
trop = grepl("tropical storm", data$EVTYPE, ignore.case=TRUE)
data$EVTYPE[trop] = "TROPICAL STORM"
valid_evtypes = c(valid_evtypes,"TROPICAL STORM")
rm(trop)
table(data$EVTYPE[grep("tsunami", data$EVTYPE, ignore.case=TRUE)])
##
## TSUNAMI
## 20
valid_evtypes = c(valid_evtypes,"TSUNAMI")
table(data$EVTYPE[grep("volcanic ash", data$EVTYPE, ignore.case=TRUE)])
##
## Volcanic Ash VOLCANIC ASH Volcanic Ash Plume
## 1 22 1
## VOLCANIC ASHFALL
## 3
volcanic = grepl("volcanic ash", data$EVTYPE, ignore.case=TRUE)
data$EVTYPE[volcanic] = "VOLCANIC ASH"
valid_evtypes = c(valid_evtypes,"VOLCANIC ASH")
rm(volcanic)
table(data$EVTYPE[grep("waterspout", data$EVTYPE, ignore.case=TRUE)])
##
## WATERSPOUT WATERSPOUT WATERSPOUT- WATERSPOUT/ WATERSPOUTS
## 1 3796 10 1 37
waterspout = grepl("waterspout", data$EVTYPE, ignore.case=TRUE)|grepl("water spout", data$EVTYPE, ignore.case=TRUE)
data$EVTYPE[waterspout] = "WATERSPOUT"
valid_evtypes = c(valid_evtypes,"WATERSPOUT")
rm(waterspout)
table(data$EVTYPE[grep("wildfire", data$EVTYPE, ignore.case=TRUE)])
##
## WILDFIRE WILDFIRES
## 2761 8
table(data$EVTYPE[grep("wild fire", data$EVTYPE, ignore.case=TRUE)])
##
## WILD FIRES
## 4
wildfire = grepl("wild", data$EVTYPE, ignore.case=TRUE)&grepl("fire", data$EVTYPE, ignore.case=TRUE)
data$EVTYPE[wildfire] = "WILDFIRE"
valid_evtypes = c(valid_evtypes,"WILDFIRE")
rm(wildfire)
table(data$EVTYPE[grep("winter storm", data$EVTYPE, ignore.case=TRUE)])
##
## WINTER STORM WINTER STORMS
## 11433 3
winter_storm = grepl("winter storm", data$EVTYPE, ignore.case=TRUE)
data$EVTYPE[winter_storm] = "WINTER STORM"
valid_evtypes = c(valid_evtypes,"WINTER STORM")
rm(winter_storm)
table(data$EVTYPE[grep("winter weather", data$EVTYPE, ignore.case=TRUE)])
##
## Winter Weather WINTER WEATHER WINTER WEATHER MIX
## 19 7026 6
## WINTER WEATHER/MIX
## 1104
winter_weather = grepl("winter weather", data$EVTYPE, ignore.case=TRUE)
data$EVTYPE[winter_weather] = "WINTER WEATHER"
valid_evtypes = c(valid_evtypes,"WINTER WEATHER")
rm(winter_weather)
And now finally the data is clean… ALMOST clean
nrow(table(data$EVTYPE))
## [1] 413
data2 = subset(data, EVTYPE %in% valid_evtypes)
table(data2$EVTYPE)
##
## ASTRONOMICAL LOW TIDE AVALANCHE BLIZZARD
## 174 388 2743
## COASTAL FLOOD COLD/WIND CHILL DEBRIS FLOW
## 852 759 1
## DENSE FOG DENSE SMOKE DROUGHT
## 1880 21 2512
## DUST DEVIL DUST STORM EXCESSIVE HEAT
## 150 439 1786
## FLASH FLOOD FLOOD FREEZING FOG
## 55676 26179 2
## FROST/FREEZE FUNNEL CLOUD HAIL
## 1505 6938 289957
## HEAT HEAVY RAIN HEAVY SNOW
## 845 11833 15817
## HIGH SURF HIGH WIND HURRICANE
## 969 21918 298
## ICE STORM LAKE-EFFECT SNOW LAKESHORE FLOOD
## 2027 659 24
## LIGHTNING MARINE HAIL MARINE STRONG WIND
## 15761 442 48
## RIP CURRENT SEICHE SLEET
## 777 21 111
## STORM SURGE/TIDE STRONG WIND THUNDERSTORM WIND
## 409 3776 335549
## TORNADO TROPICAL DEPRESSION TROPICAL STORM
## 60697 60 697
## TSUNAMI VOLCANIC ASH WATERSPOUT
## 20 27 3846
## WILDFIRE WINTER STORM WINTER WEATHER
## 4231 11436 8155
We did lose some observations in the end, but as we have seen they didn’t belong to any category the report intended to analyse.
## Cleaning damage variables
table(data$PROPDMGEXP)
##
## - ? + 0 1 2 3 4 5
## 465858 1 8 5 216 25 13 4 4 28
## 6 7 8 B h H K m M
## 4 5 1 40 1 6 424664 7 11330
table(data$CROPDMGEXP)
##
## ? 0 2 B k K m M
## 618336 7 19 1 9 21 281832 1 1994
So we see that the data is messed up again (if the previous 400+ lines of code weren’t enough…) and we need to do some cleaning again. In the original documentation (page) it’s said that “Alphabetical characters used to signify magnitude include “K” for thousands, “M” for millions, and “B” for billions.”. So looks like we need to convert our property and crop damages appropriately.
Filter out all unreasonable data:
allowed = c("k", "K", "b", "B", "m", "M")
prop_filter = data$PROPDMGEXP %in% allowed | data$PROPDMGEXP==""
crop_filter = data$CROPDMGEXP %in% allowed | data$CROPDMGEXP==""
filter = prop_filter & crop_filter
data = data[filter, ]
data[data$PROPDMGEXP== "K" | data$PROPDMGEXP=="k",]$PROPDMG = data[data$PROPDMGEXP== "K" | data$PROPDMGEXP=="k",]$PROPDMG*1000
data[data$PROPDMGEXP== "M" | data$PROPDMGEXP=="m",]$PROPDMG = data[data$PROPDMGEXP== "M" | data$PROPDMGEXP=="m",]$PROPDMG*1000000
data[data$PROPDMGEXP== "B" | data$PROPDMGEXP=="b",]$PROPDMG = data[data$PROPDMGEXP== "B" | data$PROPDMGEXP=="b",]$PROPDMG*1000000000
And then we will do the same for the crop damage:
data[data$CROPDMGEXP== "K" | data$CROPDMGEXP=="k",]$CROPDMG = data[data$CROPDMGEXP== "K" | data$CROPDMGEXP=="k",]$CROPDMG*1000
data[data$CROPDMGEXP== "M" | data$CROPDMGEXP=="m",]$CROPDMG = data[data$CROPDMGEXP== "M" | data$CROPDMGEXP=="m",]$CROPDMG*1000000
data[data$CROPDMGEXP== "B" | data$CROPDMGEXP=="b",]$CROPDMG = data[data$CROPDMGEXP== "B" | data$CROPDMGEXP=="b",]$CROPDMG*1000000000
So now that we have ALMOST clean data. For this part we will add together fatalities and injuries.
health_data = data[c("EVTYPE", "FATALITIES", "INJURIES")]
health_data$TOTAL = health_data$INJURIES + health_data$FATALITIES
#fatalities = aggregate(FATALITIES ~ EVTYPE,data = health_data, sum)
#injuries = aggregate(INJURIES ~ EVTYPE,data = health_data, sum)
total = aggregate(TOTAL ~ EVTYPE,data = health_data, sum)
top10_total = tail(total[order(total$TOTAL),],n=10)
#top10_fatalities = fatalities[order(fatalities$FATALITIES, decreasing=TRUE),][1:10,]
#top10_injuries = injuries[order(injuries$INJURIES, decreasing=TRUE),][1:10,]
top10_total[order(top10_total$TOTAL, decreasing=TRUE),]
## EVTYPE TOTAL
## 342 TORNADO 97015
## 327 THUNDERSTORM WIND 10119
## 69 EXCESSIVE HEAT 8748
## 89 FLOOD 7279
## 194 LIGHTNING 6048
## 124 HEAT 3593
## 87 FLASH FLOOD 2837
## 168 ICE STORM 2079
## 149 HIGH WIND 1788
## 396 WILDFIRE 1696
So the data looks reasonable, let’s actually plot it:
par(mfrow = c(1,1), las = 2, mar=c(5,10,4,2))
barplot(top10_total$TOTAL, width=20,
main = expression("Total health damage (injuries + fatalities)"),
col = "thistle3",
horiz=TRUE,
names.arg = top10_total$EVTYPE, cex.names = 0.8, cex.axis=0.8)
Since we have our data ready we will do similar analysis as in the previous part
damage_data = data[c("EVTYPE", "PROPDMG", "CROPDMG")]
damage_data$TOTAL = damage_data$PROPDMG + damage_data$CROPDMG
total = aggregate(TOTAL ~ EVTYPE,data = damage_data, sum)
top10_total = tail(total[order(total$TOTAL),],n=10)
top10_total[order(top10_total$TOTAL, decreasing=TRUE),]
## EVTYPE TOTAL
## 89 FLOOD 161029482607
## 155 HURRICANE 90762527810
## 342 TORNADO 57357880993
## 314 STORM SURGE/TIDE 47965579000
## 122 HAIL 20709194654
## 87 FLASH FLOOD 18438634977
## 45 DROUGHT 15018927780
## 327 THUNDERSTORM WIND 10909646291
## 168 ICE STORM 8968137810
## 396 WILDFIRE 8894345130
So the data looks reasonable, let’s actually plot it:
par(mfrow = c(1,1), las = 2, mar=c(5,10,4,2))
barplot(top10_total$TOTAL, width=20,
main = expression("Total economical damage"),
col = "thistle3",
horiz=TRUE,
names.arg = top10_total$EVTYPE, cex.names = 0.8, cex.axis=0.8)