Preliminaries

The original document describing the dataset can be loaded from here, the original dataset is here. Note that it’s 47Mb. If you can’t download any of these feel and you want to reproduce this analysis please contact me: “electric.blake at gmail dot com.”

Synopsis

In this paper we tried to analyse different types of natural events and their impact on nation’s health and nation’s economy. Many of these events result in fatalities, injuries, property damage and crop damage. It’s important to recognize those that have the highest impact and invest money and effort in preventing the consequences of these events. The dataset was created by the U.S. National Oceanic and Atmospheric Administration’s (NOAA). The most part of the data analysis is cleaning the data because most of it is dirty: names of eventa are inaccurate, formats of the damage columns are not suitable for analysis. So the main part is just data cleaning- as a result after cleaning we have a more consise dataset with just 48 types of events (as defined in the documentatin), but we lose some observations that don’t fit into any type of these categories. After cleaning we just summarize the data we have and identify the top 10 events that harm public health and economy the most.

Data processing

Cleaning the EVTYPE variable

Important note: Please note that most of the code output for this step is not shown simply because it’s too lengthy, but all the code I used for analysis is here, so you can simply run it on your machine if you want to check any of the steps.

So, let’s start. Before doing any analysis let’s look at the data we have. Let’s see the structure of this dataset. Note that it can take some time to load this dataset (it’s 47Mb even when zipped).

data = read.csv("repdata-data-StormData.csv.bz2", stringsAsFactors = FALSE)
str(data)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  "" "" "" "" ...
##  $ BGN_LOCATI: chr  "" "" "" "" ...
##  $ END_DATE  : chr  "" "" "" "" ...
##  $ END_TIME  : chr  "" "" "" "" ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  "" "" "" "" ...
##  $ END_LOCATI: chr  "" "" "" "" ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ WFO       : chr  "" "" "" "" ...
##  $ STATEOFFIC: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  "" "" "" "" ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

So, we have 37 columns (we used only few in this analysis) and 902297 ovservations. Now let’s look at the summary and check if there is a lot of missing data - if yes, we will have to fix that.

summary(data)

So turns out that the only columns that have missing values are Latitude and Latitude_E. We don’t need them in our analysis, so it’s okay for them to be N/A.
The next step of cleaning the data is to check the variables we care about. The main variable we care about is EVTYPE. This is the type of the event and we can find the list of them on page 6 of the documentation. So we expect there to be about 48 types of events while in fact we have over 900:

dim(as.data.frame(table(data$EVTYPE)))
## [1] 985   2
head(as.data.frame(table(data$EVTYPE)), n = 30)
##                              Var1 Freq
## 1              HIGH SURF ADVISORY    1
## 2                   COASTAL FLOOD    1
## 3                     FLASH FLOOD    1
## 4                       LIGHTNING    1
## 5                       TSTM WIND    4
## 6                 TSTM WIND (G45)    1
## 7                      WATERSPOUT    1
## 8                            WIND    1
## 9                               ?    1
## 10                ABNORMAL WARMTH    4
## 11                 ABNORMALLY DRY    2
## 12                 ABNORMALLY WET    1
## 13           ACCUMULATED SNOWFALL    4
## 14            AGRICULTURAL FREEZE    6
## 15                  APACHE COUNTY    1
## 16         ASTRONOMICAL HIGH TIDE  103
## 17          ASTRONOMICAL LOW TIDE  174
## 18                       AVALANCE    1
## 19                      AVALANCHE  386
## 20                   BEACH EROSIN    1
## 21                  Beach Erosion    1
## 22                  BEACH EROSION    3
## 23    BEACH EROSION/COASTAL FLOOD    1
## 24                    BEACH FLOOD    2
## 25     BELOW NORMAL PRECIPITATION    2
## 26              BITTER WIND CHILL    1
## 27 BITTER WIND CHILL TEMPERATURES    3
## 28                      Black Ice    3
## 29                      BLACK ICE   14
## 30                       BLIZZARD 2719

So we’re in trouble…
And if we look carefully we’ll notice that this table is especially dirty because the names of the equivalent events are enterent in a number of different ways. For example, we have ‘WINTER STORM’ and ‘WINTER STORMS’. We need to handle it somehow. Here is the approach we’re going to use: we will create a character vector of vaild names that will match the list of evtypes from the original documentation. We will append names to these vector one by one while processing so that we can do the checksum in the end and check if we really ended up with 48 event type.

valid_evtypes = vector()

First we’ll deal with the summary. It clearly should not be here so we will get rid of these rows.

data = data[!(grepl("summary", data$EVTYPE, ignore.case = TRUE)),]

Then let’s get rid of questionmark (“?”) rows - we don’t know what type of even it is so we can’t analyse it:

data = data[!(grepl("\\?", data$EVTYPE, ignore.case = TRUE)),]

And then we will do the cleaning for all the other event types in a similar way, but this time I will replace the dirty values with the proper ones.

ASTRONOMICAL LOW TIDE

For astronomical low tide, there is also astronomical high tide and high tides but there is no category for that in the original documentation so we will just get rid of them in the end:

data$EVTYPE[grepl("low tide", data$EVTYPE, ignore.case = TRUE)] = "ASTRONOMICAL LOW TIDE"        
valid_evtypes = c(valid_evtypes,"ASTRONOMICAL LOW TIDE")

AVALANCHE

For avalanche since there are no other mentions of avalanche:

data$EVTYPE[grepl("avalanc[h]?e", data$EVTYPE, ignore.case = TRUE)] = "AVALANCHE"        
valid_evtypes = c(valid_evtypes,"AVALANCHE")

BLIZZARD

Now let’s deal with the blizzards:

table(data$EVTYPE[grep("blizzard", data$EVTYPE, ignore.case=TRUE)])
## 
##                       BLIZZARD BLIZZARD AND EXTREME WIND CHIL 
##                           2719                              2 
##        BLIZZARD AND HEAVY SNOW               BLIZZARD WEATHER 
##                              1                              1 
##         BLIZZARD/FREEZING RAIN            BLIZZARD/HEAVY SNOW 
##                              1                              2 
##             BLIZZARD/HIGH WIND          BLIZZARD/WINTER STORM 
##                              1                              1 
##                GROUND BLIZZARD            HEAVY SNOW/BLIZZARD 
##                              2                              3 
##            HIGH WIND/ BLIZZARD             HIGH WIND/BLIZZARD 
##                              1                              6 
## HIGH WIND/BLIZZARD/FREEZING RA  HIGH WIND/WIND CHILL/BLIZZARD 
##                              1                              1 
##              Icestorm/Blizzard 
##                              1

The dominating event for all these is blizzard, so we will turn all these into blizzards:

blizzard = grepl("blizzard", data$EVTYPE, ignore.case = TRUE)
data$EVTYPE[blizzard] = "BLIZZARD"        
valid_evtypes = c(valid_evtypes,"BLIZZARD")
rm(blizzard)

COASTAL FLOOD, LAKESHORE FLOOD, FLOOD, FLASH FLOOD

For coastal flood. There is also coastal surge and coastal storm, so we need to select only flood + coast. We can observer them by looking at the data - look at levels starting from:

head(as.data.frame(table(data$EVTYPE[grep("flood", data$EVTYPE, ignore.case=TRUE)])), n = 20)
##                           Var1  Freq
## 1                COASTAL FLOOD     1
## 2                  FLASH FLOOD     1
## 3  BEACH EROSION/COASTAL FLOOD     1
## 4                  BEACH FLOOD     2
## 5             BREAKUP FLOODING     1
## 6    COASTAL  FLOODING/EROSION     1
## 7                Coastal Flood     6
## 8                COASTAL FLOOD   650
## 9             coastal flooding     2
## 10            Coastal Flooding    38
## 11            COASTAL FLOODING   143
## 12    COASTAL FLOODING/EROSION     5
## 13         COASTAL/TIDAL FLOOD     2
## 14                COASTALFLOOD     1
## 15       CSTL FLOODING/EROSION     2
## 16          Erosion/Cstl Flood     2
## 17                 FLASH FLOOD 54277
## 18    FLASH FLOOD - HEAVY RAIN     2
## 19   FLASH FLOOD FROM ICE JAMS     5
## 20      FLASH FLOOD LANDSLIDES     1

If we add the two logical vectors for flood and for coast we’ll get the logical vector that will be matching all the coastal flood mentions in the EVTYPE.

flood = grepl("flood", data$EVTYPE, ignore.case = TRUE)
# coastal flood 
coastal_flood = grepl("coastal", data$EVTYPE, ignore.case = TRUE) & flood
data$EVTYPE[coastal_flood] = "COASTAL FLOOD"        
valid_evtypes = c(valid_evtypes,"COASTAL FLOOD")
#lakeshore fllod 
lake_flood = grepl("lake", data$EVTYPE, ignore.case = TRUE) & flood
data$EVTYPE[lake_flood] = "LAKESHORE FLOOD"        
valid_evtypes = c(valid_evtypes,"LAKESHORE FLOOD")
# flash flood
flash_flood = grepl("flash", data$EVTYPE, ignore.case = TRUE) & flood
data$EVTYPE[flash_flood] = "FLASH FLOOD"        
valid_evtypes = c(valid_evtypes,"FLASH FLOOD")
# other floods:
just_flood = xor(flood, (coastal_flood | lake_flood | flash_flood))
data$EVTYPE[just_flood] = "FLOOD"        
valid_evtypes = c(valid_evtypes,"FLOOD")
rm(flood, coastal_flood, lake_flood, flash_flood, just_flood)

COLD/WIND CHIL

For cold/wind chill we have different events in the documentation which are:
- cold/wind chill
- Extreme Cold/Wind Chill
So when looking for cold/wind chill we need to exclude: extreme, excessive and record

cold = grepl("cold", data$EVTYPE, ignore.case = TRUE)
chill = grepl("chil", data$EVTYPE, ignore.case = TRUE)
cold_chill = cold | chill #will match both chill and cold

extreme = grepl("extreme", data$EVTYPE, ignore.case = TRUE) & chill #will match those that have extrem AND chill
extreme = (grepl("extreme", data$EVTYPE, ignore.case = TRUE) & cold) | extreme
cold_chil_noextreme = xor(cold_chill, extreme)

#Same for excessive and record
record = grepl("record", data$EVTYPE, ignore.case = TRUE) & chill #will match those that have extrem AND chill
record = (grepl("record", data$EVTYPE, ignore.case = TRUE) & cold) | record
cold_chil_norecord = xor(cold_chil_noextreme, record)

severe = grepl("severe", data$EVTYPE, ignore.case = TRUE) & chill #will match those that have extrem AND chill
severe = (grepl("severe", data$EVTYPE, ignore.case = TRUE) & cold) | severe
cold_chil_nosevere = xor(cold_chil_norecord, severe)
data$EVTYPE[cold_chil_nosevere] = "COLD/WIND CHILL"        
valid_evtypes = c(valid_evtypes,"COLD/WIND CHILL")
rm(cold, chill, cold_chill, extreme, cold_chil_noextreme, record, cold_chil_norecord, severe, cold_chil_nosevere)

DEBRIS FLOW

table(data$EVTYPE[grepl("debris", levels(data$EVTYPE), ignore.case=TRUE)])
## < table of extent 0 >
debris = grepl("debris", data$EVTYPE, ignore.case = TRUE)
data$EVTYPE[debris] = "DEBRIS FLOW"        
valid_evtypes = c(valid_evtypes,"DEBRIS FLOW")

DENSE FOG, FREEZING FOG

We will include dense fog and just ‘fog’ here:

table(data$EVTYPE[grep("fog", data$EVTYPE, ignore.case=TRUE)])
## 
##        DENSE FOG              FOG     Freezing Fog     FREEZING FOG 
##             1293              538                1               45 
##          Ice Fog PATCHY DENSE FOG 
##                2                3
fog = grepl("fog", data$EVTYPE, ignore.case = TRUE)
freeze_fog = (grepl("freeze", data$EVTYPE, ignore.case = TRUE) | grepl("ice", data$EVTYPE, ignore.case = TRUE)) & fog
fog_nofreeze = xor(fog, freeze_fog)
data$EVTYPE[fog_nofreeze] = "DENSE FOG"        
valid_evtypes = c(valid_evtypes,"DENSE FOG")
data$EVTYPE[freeze_fog] = "FREEZING FOG"        
valid_evtypes = c(valid_evtypes,"FREEZING FOG")
rm(fog, freeze_fog, fog_nofreeze)

DENSE SMOKE

table(data$EVTYPE[grep("smoke", data$EVTYPE, ignore.case=TRUE)])
## 
## DENSE SMOKE       SMOKE 
##          10          11
smoke = grepl("smoke", data$EVTYPE, ignore.case = TRUE)
data$EVTYPE[smoke] = "DENSE SMOKE"        
valid_evtypes = c(valid_evtypes,"DENSE SMOKE")
rm(smoke)

DROUGT

Since it’s just drought in the events list, we will selec all drought values we have:

table(data$EVTYPE[grep("drought", data$EVTYPE, ignore.case=TRUE)])
## 
##                DROUGHT DROUGHT/EXCESSIVE HEAT EXCESSIVE HEAT/DROUGHT 
##                   2488                     13                      1 
##           HEAT DROUGHT      HEAT WAVE DROUGHT           HEAT/DROUGHT 
##                      1                      1                      1 
##           SNOW DROUGHT 
##                      7
drought = grepl("drought", data$EVTYPE, ignore.case = TRUE)
data$EVTYPE[drought] = "DROUGHT"        
valid_evtypes = c(valid_evtypes,"DROUGHT")
rm(drought)

DUST STORM, DUST DEVIL

table(data$EVTYPE[grep("dust", data$EVTYPE, ignore.case=TRUE)])
## 
##          BLOWING DUST            DUST DEVEL            Dust Devil 
##                     4                     1                     8 
##            DUST DEVIL DUST DEVIL WATERSPOUT            DUST STORM 
##                   141                     1                   427 
## DUST STORM/HIGH WINDS             DUSTSTORM HIGH WINDS DUST STORM 
##                     1                     1                     1 
##          Saharan Dust          SAHARAN DUST 
##                     2                     2
dust = grepl("dust", data$EVTYPE, ignore.case = TRUE)
dust_devil = grepl("devil", data$EVTYPE, ignore.case = TRUE) & dust
dust = xor(dust_devil,dust)
data$EVTYPE[dust] = "DUST STORM"
data$EVTYPE[dust_devil] = "DUST DEVIL"
valid_evtypes = c(valid_evtypes,"DUST STORM")
valid_evtypes = c(valid_evtypes,"DUST DEVIL")
rm(dust, dust_devil)

EXCESSIVE HEAT, HEAT

table(data$EVTYPE[grep("heat", data$EVTYPE, ignore.case=TRUE)])
## 
##        EXCESSIVE HEAT          EXTREME HEAT                  HEAT 
##                  1678                    22                   767 
##             Heat Wave             HEAT WAVE            HEAT WAVES 
##                     1                    74                     2 
##             Heatburst           Record Heat           RECORD HEAT 
##                     1                     1                    81 
##      RECORD HEAT WAVE RECORD/EXCESSIVE HEAT 
##                     1                     3
heat = grepl("heat", data$EVTYPE, ignore.case = TRUE)
extreme = grepl("extreme", data$EVTYPE, ignore.case = TRUE) | grepl("excessive", data$EVTYPE, ignore.case = TRUE) | grepl("record", data$EVTYPE, ignore.case = TRUE)
extreme_heat = heat & extreme
data$EVTYPE[extreme_heat] = "EXCESSIVE HEAT"
just_heat = xor(heat, extreme_heat)
data$EVTYPE[just_heat] = "HEAT"
valid_evtypes = c(valid_evtypes,"EXCESSIVE HEAT")
valid_evtypes = c(valid_evtypes,"HEAT")
rm(heat, extreme, extreme_heat, just_heat)

FROST/FREEZE

fog = grepl("fog", data$EVTYPE, ignore.case = TRUE)
freeze_fog = (grepl("freeze", data$EVTYPE, ignore.case = TRUE)) & fog
frost_freeze = grepl("frost", data$EVTYPE, ignore.case = TRUE) | grepl("freeze", data$EVTYPE, ignore.case = TRUE)
frost_freeze = xor(frost_freeze, freeze_fog)
data$EVTYPE[frost_freeze] = "FROST/FREEZE"
valid_evtypes = c(valid_evtypes,"FROST/FREEZE")
rm(fog, freeze_fog, frost_freeze)

FUNNEL CLOUD

table(data$EVTYPE[grep("funnel", data$EVTYPE, ignore.case=TRUE)])
## 
##                         FUNNEL                   Funnel Cloud 
##                             46                              5 
##                   FUNNEL CLOUD                  FUNNEL CLOUD. 
##                           6839                              1 
##              FUNNEL CLOUD/HAIL                  FUNNEL CLOUDS 
##                              1                             87 
##                        FUNNELS THUNDERSTORM WINDS FUNNEL CLOU 
##                              1                              2 
## THUNDERSTORM WINDS/FUNNEL CLOU        WALL CLOUD/FUNNEL CLOUD 
##                              1                              1 
##        WATERSPOUT FUNNEL CLOUD 
##                              1
funnel = (grepl("funnel", data$EVTYPE, ignore.case = TRUE))
cloud = (grepl("clou", data$EVTYPE, ignore.case = TRUE))
funnel = funnel&cloud
data$EVTYPE[funnel] = "FUNNEL CLOUD"
valid_evtypes = c(valid_evtypes,"FUNNEL CLOUD")
rm(funnel, cloud)

MARINE HAIL, HAIL

marine_hail = grepl("marine hail", data$EVTYPE, ignore.case = TRUE)
hail = grepl("hail", data$EVTYPE, ignore.case = TRUE)
#get non-marine hail: 
just_hail = xor(hail,marine_hail)
data$EVTYPE[marine_hail] = "MARINE HAIL"
data$EVTYPE[just_hail] = "HAIL"        
valid_evtypes = c(valid_evtypes,"MARINE HAIL")
valid_evtypes = c(valid_evtypes,"HAIL")
rm(marine_hail, hail, just_hail)

HEAVY RAIN, HEAVY SNOW

head(as.data.frame(table(data$EVTYPE[grepl("heavy", data$EVTYPE, ignore.case=TRUE)])), n = 20)
##                             Var1  Freq
## 1                HEAVY LAKE SNOW    25
## 2                      HEAVY MIX     8
## 3            HEAVY PRECIPATATION     1
## 4            Heavy Precipitation     2
## 5            HEAVY PRECIPITATION     1
## 6                     Heavy rain     3
## 7                     Heavy Rain    16
## 8                     HEAVY RAIN 11723
## 9            Heavy Rain and Wind     4
## 10            HEAVY RAIN EFFECTS     1
## 11          Heavy Rain/High Surf     1
## 12          HEAVY RAIN/LIGHTNING     1
## 13     HEAVY RAIN/SEVERE WEATHER     2
## 14 HEAVY RAIN/SMALL STREAM URBAN     1
## 15               HEAVY RAIN/SNOW     1
## 16               HEAVY RAIN/WIND     4
## 17                HEAVY RAINFALL     3
## 18                   HEAVY RAINS    26
## 19                    HEAVY SEAS     2
## 20                  HEAVY SHOWER     2
heavy = grepl("heavy", data$EVTYPE, ignore.case = TRUE)
snow = grepl("snow", data$EVTYPE, ignore.case = TRUE)
rain = grepl("rain", data$EVTYPE, ignore.case = TRUE)
snow_rain = snow&rain
heavy_snow = xor(heavy&snow, snow_rain)
heavy_rain = xor(heavy&rain, snow_rain)
data$EVTYPE[heavy_snow] = "HEAVY SNOW"
data$EVTYPE[heavy_rain] = "HEAVY RAIN"        
valid_evtypes = c(valid_evtypes,"HEAVY SNOW")
valid_evtypes = c(valid_evtypes,"HEAVY RAIN")
rm(heavy, snow, rain, snow_rain, heavy_snow, heavy_rain)

HIGH SURF, HIGH WIND

head(as.data.frame(table(data$EVTYPE[grep("high", data$EVTYPE, ignore.case=TRUE)])), n = 20)
##                       Var1  Freq
## 1       HIGH SURF ADVISORY     1
## 2   ASTRONOMICAL HIGH TIDE   103
## 3     HEAVY SURF/HIGH SURF   228
## 4                     HIGH     1
## 5             HIGH  SWELLS     1
## 6              HIGH  WINDS     1
## 7                HIGH SEAS     8
## 8                High Surf     9
## 9                HIGH SURF   725
## 10    HIGH SURF ADVISORIES     1
## 11      HIGH SURF ADVISORY     4
## 12             HIGH SWELLS     5
## 13 HIGH TEMPERATURE RECORD     3
## 14              HIGH TIDES     2
## 15              HIGH WATER     6
## 16              HIGH WAVES     3
## 17               High Wind     2
## 18               HIGH WIND 20212
## 19         HIGH WIND (G40)     2
## 20            HIGH WIND 48     1
high = grepl("high", data$EVTYPE, ignore.case = TRUE)
wind = grepl("wind", data$EVTYPE, ignore.case = TRUE)
surf = grepl("surf", data$EVTYPE, ignore.case = TRUE)
wind_surf = wind&surf
high_wind = xor(high&wind, wind_surf)
high_surf = xor(high&surf, wind_surf)
data$EVTYPE[high_wind] = "HIGH WIND"
data$EVTYPE[high_surf] = "HIGH SURF"        
valid_evtypes = c(valid_evtypes,"HIGH WIND")
valid_evtypes = c(valid_evtypes,"HIGH SURF")
rm(high, wind, surf, wind_surf, high_wind, high_surf)

HURRICANE (TYPHOON)

table(data$EVTYPE[grep("hurricane", data$EVTYPE, ignore.case=TRUE)])
## 
##                  HURRICANE HURRICANE-GENERATED SWELLS 
##                        174                          3 
##          Hurricane Edouard            HURRICANE EMILY 
##                          2                          1 
##             HURRICANE ERIN            HURRICANE FELIX 
##                          7                          2 
##           HURRICANE GORDON             HURRICANE OPAL 
##                          1                          9 
##          HURRICANE/TYPHOON 
##                         88
table(data$EVTYPE[grep("typhoon", data$EVTYPE, ignore.case=TRUE)])
## 
## HURRICANE/TYPHOON           TYPHOON 
##                88                11
hurricane_typhoon = grepl("hurricane", data$EVTYPE, ignore.case = TRUE) | grepl("typhoon", data$EVTYPE, ignore.case = TRUE)
data$EVTYPE[hurricane_typhoon] = "HURRICANE"
valid_evtypes = c(valid_evtypes,"HURRICANE")
rm(hurricane_typhoon)

ICE STORM

table(data$EVTYPE[grep("ice storm", data$EVTYPE, ignore.case=TRUE)])
## 
##    GLAZE/ICE STORM          ICE STORM ICE STORM AND SNOW 
##                  1               2006                  1 
##    SLEET/ICE STORM SNOW AND ICE STORM     SNOW/ICE STORM 
##                  1                  1                 17
ice_storm = grepl("ice storm", data$EVTYPE, ignore.case = TRUE)
data$EVTYPE[ice_storm] = "ICE STORM"
valid_evtypes = c(valid_evtypes,"ICE STORM")
rm(ice_storm)

LAKE-EFFECT SNOW

table(data$EVTYPE[grep("lake", data$EVTYPE, ignore.case=TRUE)])
## 
##  GUSTY LAKE WIND LAKE-EFFECT SNOW Lake Effect Snow LAKE EFFECT SNOW 
##                1              636                2               21 
##  LAKESHORE FLOOD 
##               24
lake_effect = grepl("lake", data$EVTYPE, ignore.case = TRUE) & grepl("effect", data$EVTYPE, ignore.case = TRUE) & grepl("snow", data$EVTYPE, ignore.case = TRUE)
data$EVTYPE[lake_effect] = "LAKE-EFFECT SNOW"
valid_evtypes = c(valid_evtypes,"LAKE-EFFECT SNOW")
rm(lake_effect)

LIGHTNING

table(data$EVTYPE[grep("lightning", data$EVTYPE, ignore.case=TRUE)])
## 
##                      LIGHTNING                      LIGHTNING 
##                              1                          15754 
##             LIGHTNING  WAUSEON LIGHTNING AND THUNDERSTORM WIN 
##                              1                              1 
##            LIGHTNING AND WINDS               LIGHTNING DAMAGE 
##                              1                              1 
##                 LIGHTNING FIRE               LIGHTNING INJURY 
##                              1                              1 
##   LIGHTNING THUNDERSTORM WINDS  LIGHTNING THUNDERSTORM WINDSS 
##                              1                              1 
##                     LIGHTNING.    THUNDERSTORM WIND/LIGHTNING 
##                              1                              1 
##   THUNDERSTORM WINDS LIGHTNING        TSTM WIND AND LIGHTNING 
##                              7                              1
light = grepl("lightning", data$EVTYPE, ignore.case = TRUE)
light_storm = grepl("lightning", data$EVTYPE, ignore.case = TRUE)&(grepl("storm", data$EVTYPE, ignore.case = TRUE)|grepl("tstm", data$EVTYPE, ignore.case = TRUE))
light = xor(light,light_storm)
data$EVTYPE[light] = "LIGHTNING"
valid_evtypes = c(valid_evtypes,"LIGHTNING")
rm(light, light_storm)

MARINE STRONG WIND, MARINE HIGH WIND, MARINE THUNDERSTORM WIND

table(data$EVTYPE[grep("marine", data$EVTYPE, ignore.case=TRUE)])
## 
##          Marine Accident              MARINE HAIL            MARINE MISHAP 
##                        1                      442                        2 
##       MARINE STRONG WIND MARINE THUNDERSTORM WIND         MARINE TSTM WIND 
##                       48                     5812                     6175
marine_strong_wind = grepl("marine", data$EVTYPE, ignore.case=TRUE) & grepl("wind", data$EVTYPE, ignore.case=TRUE) & grepl("strong", data$EVTYPE, ignore.case=TRUE)
marine_high_wind = grepl("marine", data$EVTYPE, ignore.case=TRUE) & grepl("wind", data$EVTYPE, ignore.case=TRUE) & grepl("high", data$EVTYPE, ignore.case=TRUE)
marine_tstm_wind = grepl("marine", data$EVTYPE, ignore.case=TRUE) & grepl("wind", data$EVTYPE, ignore.case=TRUE) & (grepl("tstm", data$EVTYPE, ignore.case=TRUE)|grepl("thunderstorm", data$EVTYPE, ignore.case=TRUE))
data$EVTYPE[marine_strong_wind] = "MARINE STRONG WIND"
data$EVTYPE[marine_high_wind] = "MARINE HIGH WIND"
data$EVTYPE[marine_tstm_wind] = "MARINE THUNDERSTORM WIND"
valid_evtypes = c(valid_evtypes,"MARINE STRONG WIND")
valid_evtypes = c(valid_evtypes,"MARINE HIGH WIND")
valid_evtypes = c(valid_evtypes,"MARINE THUNDERSTORM WIND")

RIP CURRENT

table(data$EVTYPE[grep("rip", data$EVTYPE, ignore.case=TRUE)])
## 
##             RIP CURRENT            RIP CURRENTS RIP CURRENTS HEAVY SURF 
##                     470                     304                       1 
## RIP CURRENTS/HEAVY SURF 
##                       2
rip_current = grepl("rip current", data$EVTYPE, ignore.case=TRUE)
data$EVTYPE[rip_current] = "RIP CURRENT"
valid_evtypes = c(valid_evtypes,"RIP CURRENT")
rm(rip_current)

SEICHE

table(data$EVTYPE[grep("seiche", data$EVTYPE, ignore.case=TRUE)])
## 
## SEICHE 
##     21
valid_evtypes = c(valid_evtypes,"SEICHE")

SLEET

table(data$EVTYPE[grep("sleet", data$EVTYPE, ignore.case=TRUE)])
## 
##       FREEZING RAIN AND SLEET       FREEZING RAIN SLEET AND 
##                             6                             1 
## FREEZING RAIN SLEET AND LIGHT           FREEZING RAIN/SLEET 
##                             1                             9 
##          LIGHT SNOW AND SLEET                         SLEET 
##                             2                            59 
##         SLEET & FREEZING RAIN                   SLEET STORM 
##                             1                            12 
##           SLEET/FREEZING RAIN                    SLEET/SNOW 
##                             2                             2 
##                Snow and sleet                SNOW AND SLEET 
##                             1                             4 
##                    SNOW SLEET                    SNOW/SLEET 
##                             1                            10
sleet = grepl("sleet", data$EVTYPE, ignore.case=TRUE)
data$EVTYPE[sleet] = "SLEET"
valid_evtypes = c(valid_evtypes,"SLEET")
rm(sleet)

STROM SURGE/TIDE

table(data$EVTYPE[grep("storm surge", data$EVTYPE, ignore.case=TRUE)])
## 
##      STORM SURGE STORM SURGE/TIDE 
##              261              148
storm_surge = grepl("storm surge", data$EVTYPE, ignore.case=TRUE)
data$EVTYPE[storm_surge] = "STORM SURGE/TIDE"
valid_evtypes = c(valid_evtypes,"STORM SURGE/TIDE")
rm(storm_surge)

STRONG WIND

table(data$EVTYPE[grep("strong wind", data$EVTYPE, ignore.case=TRUE)])
## 
##   ICE/STRONG WINDS MARINE STRONG WIND        Strong Wind 
##                  1                 48                  3 
##        STRONG WIND   STRONG WIND GUST       Strong winds 
##               3566                  2                  1 
##       Strong Winds       STRONG WINDS 
##                  7                196
strong_wind = grepl("strong wind", data$EVTYPE, ignore.case=TRUE)
marine = grepl("strong wind", data$EVTYPE, ignore.case=TRUE) & grepl("marine", data$EVTYPE, ignore.case=TRUE)
strong_wind = xor(strong_wind, marine)
data$EVTYPE[strong_wind] = "STRONG WIND"
valid_evtypes = c(valid_evtypes,"STRONG WIND")
rm(strong_wind)

ThUNDERSTORM WIND

head(as.data.frame(table(data$EVTYPE[grep("thunderstorm wind", data$EVTYPE, ignore.case=TRUE)])),n=20)
##                             Var1  Freq
## 1        GUSTY THUNDERSTORM WIND     3
## 2       GUSTY THUNDERSTORM WINDS     5
## 3   LIGHTNING THUNDERSTORM WINDS     1
## 4  LIGHTNING THUNDERSTORM WINDSS     1
## 5       MARINE THUNDERSTORM WIND 11987
## 6      SEVERE THUNDERSTORM WINDS     5
## 7              Thunderstorm Wind     1
## 8              THUNDERSTORM WIND 82563
## 9        THUNDERSTORM WIND (G40)     1
## 10          THUNDERSTORM WIND 50     2
## 11          THUNDERSTORM WIND 52     1
## 12          THUNDERSTORM WIND 56     1
## 13          THUNDERSTORM WIND 59     1
## 14      THUNDERSTORM WIND 59 MPH     1
## 15     THUNDERSTORM WIND 59 MPH.     1
## 16      THUNDERSTORM WIND 60 MPH     4
## 17      THUNDERSTORM WIND 65 MPH     1
## 18       THUNDERSTORM WIND 65MPH     1
## 19          THUNDERSTORM WIND 69     1
## 20      THUNDERSTORM WIND 98 MPH     1
table(data$EVTYPE[grep("tstm", data$EVTYPE, ignore.case=TRUE)])
## 
##               TSTM WIND         TSTM WIND (G45)           NON-TSTM WIND 
##                       4                       1                       1 
##           NON TSTM WIND                    TSTM               Tstm Wind 
##                       2                       1                       2 
##               TSTM WIND        TSTM WIND  (G45)          TSTM WIND (41) 
##                  219940                       1                       1 
##         TSTM WIND (G35)         TSTM WIND (G40)         TSTM WIND (G45) 
##                       1                      10                      39 
##            TSTM WIND 40            TSTM WIND 45            TSTM WIND 50 
##                       1                       1                       1 
##            TSTM WIND 51            TSTM WIND 52            TSTM WIND 55 
##                       2                       5                       3 
##           TSTM WIND 65) TSTM WIND AND LIGHTNING        TSTM WIND DAMAGE 
##                       1                       1                       1 
##           TSTM WIND G45           TSTM WIND G58              TSTM WINDS 
##                       1                       1                       6 
##                TSTM WND                   TSTMW 
##                       1                       1
tstm_wind = grepl("thunderstorm wind", data$EVTYPE, ignore.case=TRUE)|grepl("tstm wind", data$EVTYPE, ignore.case=TRUE)
data$EVTYPE[tstm_wind] = "THUNDERSTORM WIND"
valid_evtypes = c(valid_evtypes,"THUNDERSTORM WIND")
rm(tstm_wind)

TORNADO

table(data$EVTYPE[grep("tornado", data$EVTYPE, ignore.case=TRUE)])
## 
##             TORNADO          TORNADO F0          TORNADO F1 
##               60652                  19                   4 
##          TORNADO F2          TORNADO F3  TORNADO/WATERSPOUT 
##                   3                   2                   1 
##           TORNADOES            TORNADOS  WATERSPOUT-TORNADO 
##                   2                   1                   2 
##  WATERSPOUT TORNADO WATERSPOUT/ TORNADO  WATERSPOUT/TORNADO 
##                   1                   2                   8
tornado = grepl("tornado", data$EVTYPE, ignore.case=TRUE)
data$EVTYPE[tornado] = "TORNADO"
valid_evtypes = c(valid_evtypes,"TORNADO")
rm(tornado)

TROPICAL DEPRESSION

table(data$EVTYPE[grep("tropical depression", data$EVTYPE, ignore.case=TRUE)])
## 
## TROPICAL DEPRESSION 
##                  60
valid_evtypes = c(valid_evtypes,"TROPICAL DEPRESSION")

TROPICAL STORM

table(data$EVTYPE[grep("tropical storm", data$EVTYPE, ignore.case=TRUE)])
## 
##         TROPICAL STORM TROPICAL STORM ALBERTO    TROPICAL STORM DEAN 
##                    690                      1                      2 
##  TROPICAL STORM GORDON   TROPICAL STORM JERRY 
##                      1                      3
trop = grepl("tropical storm", data$EVTYPE, ignore.case=TRUE)
data$EVTYPE[trop] = "TROPICAL STORM"
valid_evtypes = c(valid_evtypes,"TROPICAL STORM")
rm(trop)

TSUNAMI

table(data$EVTYPE[grep("tsunami", data$EVTYPE, ignore.case=TRUE)])
## 
## TSUNAMI 
##      20
valid_evtypes = c(valid_evtypes,"TSUNAMI")

VOLCANIC ASH

table(data$EVTYPE[grep("volcanic ash", data$EVTYPE, ignore.case=TRUE)])
## 
##       Volcanic Ash       VOLCANIC ASH Volcanic Ash Plume 
##                  1                 22                  1 
##   VOLCANIC ASHFALL 
##                  3
volcanic = grepl("volcanic ash", data$EVTYPE, ignore.case=TRUE)
data$EVTYPE[volcanic] = "VOLCANIC ASH"
valid_evtypes = c(valid_evtypes,"VOLCANIC ASH")
rm(volcanic)

WATERSPOUT

table(data$EVTYPE[grep("waterspout", data$EVTYPE, ignore.case=TRUE)])
## 
##  WATERSPOUT  WATERSPOUT WATERSPOUT- WATERSPOUT/ WATERSPOUTS 
##           1        3796          10           1          37
waterspout = grepl("waterspout", data$EVTYPE, ignore.case=TRUE)|grepl("water spout", data$EVTYPE, ignore.case=TRUE)
data$EVTYPE[waterspout] = "WATERSPOUT"
valid_evtypes = c(valid_evtypes,"WATERSPOUT")
rm(waterspout)

WILDFIRE

table(data$EVTYPE[grep("wildfire", data$EVTYPE, ignore.case=TRUE)])
## 
##  WILDFIRE WILDFIRES 
##      2761         8
table(data$EVTYPE[grep("wild fire", data$EVTYPE, ignore.case=TRUE)])
## 
## WILD FIRES 
##          4
wildfire = grepl("wild", data$EVTYPE, ignore.case=TRUE)&grepl("fire", data$EVTYPE, ignore.case=TRUE)
data$EVTYPE[wildfire] = "WILDFIRE"
valid_evtypes = c(valid_evtypes,"WILDFIRE")
rm(wildfire)

WINTER STORM

table(data$EVTYPE[grep("winter storm", data$EVTYPE, ignore.case=TRUE)])
## 
##  WINTER STORM WINTER STORMS 
##         11433             3
winter_storm = grepl("winter storm", data$EVTYPE, ignore.case=TRUE)
data$EVTYPE[winter_storm] = "WINTER STORM"
valid_evtypes = c(valid_evtypes,"WINTER STORM")
rm(winter_storm)

WINTER WEATHER

table(data$EVTYPE[grep("winter weather", data$EVTYPE, ignore.case=TRUE)])
## 
##     Winter Weather     WINTER WEATHER WINTER WEATHER MIX 
##                 19               7026                  6 
## WINTER WEATHER/MIX 
##               1104
winter_weather = grepl("winter weather", data$EVTYPE, ignore.case=TRUE)
data$EVTYPE[winter_weather] = "WINTER WEATHER"
valid_evtypes = c(valid_evtypes,"WINTER WEATHER")
rm(winter_weather)

And now finally the data is clean… ALMOST clean

nrow(table(data$EVTYPE))
## [1] 413
data2 = subset(data, EVTYPE %in% valid_evtypes)
table(data2$EVTYPE)
## 
## ASTRONOMICAL LOW TIDE             AVALANCHE              BLIZZARD 
##                   174                   388                  2743 
##         COASTAL FLOOD       COLD/WIND CHILL           DEBRIS FLOW 
##                   852                   759                     1 
##             DENSE FOG           DENSE SMOKE               DROUGHT 
##                  1880                    21                  2512 
##            DUST DEVIL            DUST STORM        EXCESSIVE HEAT 
##                   150                   439                  1786 
##           FLASH FLOOD                 FLOOD          FREEZING FOG 
##                 55676                 26179                     2 
##          FROST/FREEZE          FUNNEL CLOUD                  HAIL 
##                  1505                  6938                289957 
##                  HEAT            HEAVY RAIN            HEAVY SNOW 
##                   845                 11833                 15817 
##             HIGH SURF             HIGH WIND             HURRICANE 
##                   969                 21918                   298 
##             ICE STORM      LAKE-EFFECT SNOW       LAKESHORE FLOOD 
##                  2027                   659                    24 
##             LIGHTNING           MARINE HAIL    MARINE STRONG WIND 
##                 15761                   442                    48 
##           RIP CURRENT                SEICHE                 SLEET 
##                   777                    21                   111 
##      STORM SURGE/TIDE           STRONG WIND     THUNDERSTORM WIND 
##                   409                  3776                335549 
##               TORNADO   TROPICAL DEPRESSION        TROPICAL STORM 
##                 60697                    60                   697 
##               TSUNAMI          VOLCANIC ASH            WATERSPOUT 
##                    20                    27                  3846 
##              WILDFIRE          WINTER STORM        WINTER WEATHER 
##                  4231                 11436                  8155

We did lose some observations in the end, but as we have seen they didn’t belong to any category the report intended to analyse.
## Cleaning damage variables

table(data$PROPDMGEXP)
## 
##             -      ?      +      0      1      2      3      4      5 
## 465858      1      8      5    216     25     13      4      4     28 
##      6      7      8      B      h      H      K      m      M 
##      4      5      1     40      1      6 424664      7  11330
table(data$CROPDMGEXP)
## 
##             ?      0      2      B      k      K      m      M 
## 618336      7     19      1      9     21 281832      1   1994

So we see that the data is messed up again (if the previous 400+ lines of code weren’t enough…) and we need to do some cleaning again. In the original documentation (page) it’s said that “Alphabetical characters used to signify magnitude include “K” for thousands, “M” for millions, and “B” for billions.”. So looks like we need to convert our property and crop damages appropriately.

Filter out all unreasonable data:

allowed = c("k", "K", "b", "B", "m", "M")
prop_filter = data$PROPDMGEXP %in% allowed | data$PROPDMGEXP==""
crop_filter = data$CROPDMGEXP %in% allowed | data$CROPDMGEXP==""
filter = prop_filter & crop_filter
data = data[filter, ]
data[data$PROPDMGEXP== "K" | data$PROPDMGEXP=="k",]$PROPDMG = data[data$PROPDMGEXP== "K" | data$PROPDMGEXP=="k",]$PROPDMG*1000
data[data$PROPDMGEXP== "M" | data$PROPDMGEXP=="m",]$PROPDMG = data[data$PROPDMGEXP== "M" | data$PROPDMGEXP=="m",]$PROPDMG*1000000
data[data$PROPDMGEXP== "B" | data$PROPDMGEXP=="b",]$PROPDMG = data[data$PROPDMGEXP== "B" | data$PROPDMGEXP=="b",]$PROPDMG*1000000000

And then we will do the same for the crop damage:

data[data$CROPDMGEXP== "K" | data$CROPDMGEXP=="k",]$CROPDMG = data[data$CROPDMGEXP== "K" | data$CROPDMGEXP=="k",]$CROPDMG*1000
data[data$CROPDMGEXP== "M" | data$CROPDMGEXP=="m",]$CROPDMG = data[data$CROPDMGEXP== "M" | data$CROPDMGEXP=="m",]$CROPDMG*1000000
data[data$CROPDMGEXP== "B" | data$CROPDMGEXP=="b",]$CROPDMG = data[data$CROPDMGEXP== "B" | data$CROPDMGEXP=="b",]$CROPDMG*1000000000

RESULTS

Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

So now that we have ALMOST clean data. For this part we will add together fatalities and injuries.

health_data = data[c("EVTYPE", "FATALITIES", "INJURIES")]
health_data$TOTAL = health_data$INJURIES + health_data$FATALITIES

#fatalities = aggregate(FATALITIES ~ EVTYPE,data = health_data, sum)
#injuries = aggregate(INJURIES ~ EVTYPE,data = health_data, sum)
total = aggregate(TOTAL ~ EVTYPE,data = health_data, sum)

top10_total = tail(total[order(total$TOTAL),],n=10)
#top10_fatalities = fatalities[order(fatalities$FATALITIES, decreasing=TRUE),][1:10,]
#top10_injuries = injuries[order(injuries$INJURIES, decreasing=TRUE),][1:10,]

top10_total[order(top10_total$TOTAL, decreasing=TRUE),]
##                EVTYPE TOTAL
## 342           TORNADO 97015
## 327 THUNDERSTORM WIND 10119
## 69     EXCESSIVE HEAT  8748
## 89              FLOOD  7279
## 194         LIGHTNING  6048
## 124              HEAT  3593
## 87        FLASH FLOOD  2837
## 168         ICE STORM  2079
## 149         HIGH WIND  1788
## 396          WILDFIRE  1696

So the data looks reasonable, let’s actually plot it:

par(mfrow = c(1,1), las = 2, mar=c(5,10,4,2))
barplot(top10_total$TOTAL, width=20,         
        main = expression("Total health damage (injuries + fatalities)"),
        col = "thistle3",
        horiz=TRUE, 
        names.arg = top10_total$EVTYPE, cex.names = 0.8, cex.axis=0.8)       

Across the United States, which types of events have the greatest economic consequences?

Since we have our data ready we will do similar analysis as in the previous part

damage_data = data[c("EVTYPE", "PROPDMG", "CROPDMG")]
damage_data$TOTAL = damage_data$PROPDMG + damage_data$CROPDMG

total = aggregate(TOTAL ~ EVTYPE,data = damage_data, sum)

top10_total = tail(total[order(total$TOTAL),],n=10)

top10_total[order(top10_total$TOTAL, decreasing=TRUE),]
##                EVTYPE        TOTAL
## 89              FLOOD 161029482607
## 155         HURRICANE  90762527810
## 342           TORNADO  57357880993
## 314  STORM SURGE/TIDE  47965579000
## 122              HAIL  20709194654
## 87        FLASH FLOOD  18438634977
## 45            DROUGHT  15018927780
## 327 THUNDERSTORM WIND  10909646291
## 168         ICE STORM   8968137810
## 396          WILDFIRE   8894345130

So the data looks reasonable, let’s actually plot it:

par(mfrow = c(1,1), las = 2, mar=c(5,10,4,2))
barplot(top10_total$TOTAL, width=20,         
        main = expression("Total economical damage"),
        col = "thistle3",        
        horiz=TRUE, 
        names.arg = top10_total$EVTYPE, cex.names = 0.8, cex.axis=0.8)