This report takes NOAAs meteorological events list and attempt to draw health and economic generalities from it. I will take all data, include only those that have fatalities, injuries or economic damage via property or crops and then, for comparison, only include data from 1994 onward because it appears that is when more category types were included. Cleaning of the event names is essential to the categorical analysis and will entail taking NOAAs official name list and coercing most of the data points into those categories and pruning high value categories to a more general one. Once the events are more generalised, we’ll compare the top events by year to see trends in the most destructive types then rank them individually for fatalities, injures then economic value. I’ll then cross reference the top 20 of those lists to see what shows up between both fatalities and injuries as well as what events cause the worst effects in all three realms.
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.2.3
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(stringdist)
## Warning: package 'stringdist' was built under R version 3.2.3
library(lubridate)
Read in the Data
#Storm Data Downloaded Sunday, Jan 24 at 08:24 AM PST
StormDataFull <- tbl_df(read.csv(bzfile('repdata-data-StormData.csv.bz2')))
Start Processing
Getting Familiar with the Data
StormData <- select(StormDataFull, Begin = BGN_DATE, Event = EVTYPE, FATALITIES:CROPDMGEXP)
EventTable <- tbl_df(as.data.frame(with(StormData, table(Event)))) %>% arrange(desc(Freq))
Looks like Hail, TSTM Wind, Thunderstorm Wind, Tornado, Flash Floods, Floods, Thunderstorm Winds, High Wind, Lightning and Heavy Snow are our top ten number of entries. Also, Thunderstorm Wind, TSTM Wind and Thunderstorm Winds should be the same category. There may be other categories that are similar to each other so let’s take a look:
#conducted name similarity to see how the names were grouping
#NameListLong <- EventTable$Event
#agrep(NameListLong[1], NameListLong, ignore.case = TRUE, value = TRUE, useBytes = FALSE)
#agrep(NameListLong[2], NameListLong, ignore.case = TRUE, value = TRUE, useBytes = FALSE)
#agrep(NameListLong[3], NameListLong, ignore.case = TRUE, value = TRUE, useBytes = FALSE)
#grep(NameListLong[4], NameListLong, ignore.case = TRUE, value = TRUE, useBytes = FALSE)
#agrep(NameListLong[5], NameListLong, ignore.case = TRUE, value = TRUE, useBytes = FALSE)
#commented out for brevity of report
This was only the first top five categories and looks like a lot of the categories are duplicated as well and events can be in multiple categories with some modifiers. Later, I’d like to generalize these so that smaller categories get lumped with the main ones. But first, let’s look at economic and health effects at this stage.
Modify Damage amounts based on exponent
Exponents <- unique(c(levels(StormData$PROPDMGEXP), levels(StormData$CROPDMGEXP)))
Expvalues <- as.numeric(c(1, 1, 1, 1, 1, 10^1, 10^2, 10^3, 10^4, 10^5, 10^6, 10^7, 10^9, 10^9, 10^2, 10^2, 10^3, 10^6, 10^6, 10^3))
#create a dictionary of values and mapped values
Modifier <- data.frame(DMGEXP = Exponents, Values = Expvalues, stringsAsFactors = TRUE)
#Use the dictionary to change the original exp values for property and crops to numerical values
StormData <- left_join(StormData, Modifier, by = c("PROPDMGEXP"="DMGEXP")) %>% rename(PROPEXP = Values)
## Warning in left_join_impl(x, y, by$x, by$y): joining factors with different
## levels, coercing to character vector
StormData <- left_join(StormData, Modifier, by = c("CROPDMGEXP"="DMGEXP")) %>% rename(CROPEXP = Values)
## Warning in left_join_impl(x, y, by$x, by$y): joining factors with different
## levels, coercing to character vector
#Use the values provided and exp values create total values for property and crop
#Then add those two together to get total economic impact
StormData <- mutate(StormData, PROPVALUE = PROPDMG * PROPEXP) %>%
mutate(CROPVALUE = CROPDMG * CROPEXP) %>%
mutate(TOTALVALUE = PROPVALUE + CROPVALUE) %>%
select(Begin, Event, Fatalities = FATALITIES, Injuries = INJURIES, TotalValue = TOTALVALUE) %>%
filter(Fatalities > 0 | Injuries > 0 | TotalValue > 0)
Now to do a little pre-processing before we start lumping categories together, let’s entirely eliminate the ones that simply don’t matter to the statistics. For this, we’ll group by event type and only keep those in which there was a fatality, injury, or some form of property damage.
#Translate all Events to uppercase as well to prevent duplicates by case
StormData$Event <- toupper(StormData$Event)
StormData$Event <- gsub('TSTMW', 'THUNDERSTORM WIND', x = StormData$Event)
StormData$Event <- as.factor(gsub('TSTM', 'THUNDERSTORM', x = StormData$Event))
NameList <- levels(as.factor(as.character(StormData$Event))) #remove uneeded factor levels
Right off the bat we’ve gone from 985 categories to 441 categories. Now this is still ten times larger than the official list given in the NOAA data documentation but it’s a start.
ValidNames <- as.factor(c('ASTRONOMICAL LOW TIDE', 'AVALANCHE', 'BLIZZARD', 'COASTAL FLOOD', 'COLD/WIND CHILL', 'DEBRIS FLOW', 'DENSE FOG', 'DENSE SMOKE', 'DROUGHT', 'DUST DEVIL', 'DUST STORM', 'EXCESSIVE HEAT', 'EXTREME COLD/WIND CHILL', 'FLASH FLOOD', 'FLOOD', 'FREEZING FOG', 'FROST/FREEZE', 'FUNNEL CLOUD', 'HAIL', 'HEAT', 'HEAVY RAIN', 'HEAVY SNOW', 'HIGH SURF', 'HIGH WIND', 'HURRICANE/TYPHOON', 'ICE STORM', 'LAKESHORE FLOOD', 'LAKE-EFFECT SNOW', 'LIGHTNING', 'MARINE HAIL', 'MARINE HIGH WIND', 'MARINE STRONG WIND', 'MARINE THUNDERSTORM WIND', 'RIP CURRENT', 'SEICHE', 'SLEET', 'STORM TIDE', 'STRONG WIND', 'THUNDERSTORM WIND', 'TORNADO', 'TROPICAL DEPRESSION', 'TROPICAL STORM', 'TSUNAMI', 'VOLCANIC ASH', 'WATERSPOUT', 'WILDFIRE', 'WINDSTORM', 'WINTER STORM', 'WINTER WEATHER'))
DistanceMatrix <- adist(NameList, ValidNames)
mins<-apply(DistanceMatrix,1,function(x)return(array(which.min(x))))
minvalues <- apply(DistanceMatrix, 1, function(x)return(array(min(x))))
#Create a dictionary with old Event names and start by mapping to a Valid Name that's only a character or two different
#Except for HAIL because it would turn into RAIN
Mapped <- data.frame(Original = NameList, New = ValidNames[mins], Distance = minvalues)
Mapped <- mutate(Mapped, FinalEvent = ifelse( Mapped$Distance == 1 | (Mapped$Distance==2 & Mapped$Original!='HAIL'), as.character(Mapped$New), as.character(Mapped$Original )))
#Now, based on data exploration, start to substitute larger category names for smaller ones
Mapped$FinalEvent <- gsub('SML', 'SMALL', x = Mapped$FinalEvent)
Mapped$FinalEvent <- gsub('FLD', 'FLOOD', x = Mapped$FinalEvent)
Mapped$FinalEvent <- gsub('CSTL', 'COASTAL', x = Mapped$FinalEvent)
#Change abbreviations
Mapped$FinalEvent <- gsub(' +', ' ', x = Mapped$FinalEvent)
Mapped$FinalEvent <- gsub('/', ' ', x = Mapped$FinalEvent)
Mapped$FinalEvent <- gsub(' AND ', ' ', x = Mapped$FinalEvent)
Mapped$FinalEvent <- gsub(' & ', ' ', x = Mapped$FinalEvent)
#Change tiny characters
Mapped$FinalEvent <- gsub('^ +', '', x = Mapped$FinalEvent)
#get rid of any spaces at the start of the event name
Mapped$FinalEvent <- gsub('^THUNDERSTORM WIND.*', 'THUNDERSTORM WIND', x = Mapped$FinalEvent)
#Anything starting with thunderstorm wind will be only that
Mapped$FinalEvent <- gsub('^TORNADO.*', 'TORNADO', x = Mapped$FinalEvent)
#Same for Tornado
Mapped$FinalEvent <- gsub('.*WATERSPOUT.*', 'WATERSPOUT', x = Mapped$FinalEvent)
#anything with waterspout is just waterspout
Mapped$FinalEvent <- gsub('FLOODING', 'FLOOD', x = Mapped$FinalEvent)
Mapped$FinalEvent <- gsub('.*FLASH FLOOD.*', 'FLASH FLOOD', x = Mapped$FinalEvent)
#Flood and flooding are synonomous
Mapped$FinalEvent <- gsub('^HURRICANE.*', 'HURRICANE/TYPHOON', x = Mapped$FinalEvent)
#Hurricanes all together
Mapped$FinalEvent <- gsub('.*TYPHOON.*', 'HURRICANE/TYPHOON', x = Mapped$FinalEvent)
#A typhoon is just another name for a Hurricane and not officially recognized in the list of valid names
Mapped$FinalEvent <- gsub('.*TROPICAL STORM.*', 'TROPICAL STORM', x = Mapped$FinalEvent)
#Group all tropical storms together
Mapped$FinalEvent <- gsub('.*BLIZZARD.*', 'BLIZZARD', x = Mapped$FinalEvent)
#Chose to group anything with Blizzard as BLIZZARD although there is one category with Heavy Snow/Blizzard/Avalanche - it appears only once in 1994
#Avalanche is not a very frequent event and this avalanche was probably caused by the blizzard anyways.
Mapped$FinalEvent <- gsub('.*WILD.*', 'WILDFIRE', x = Mapped$FinalEvent)
#Wildfires in all forms
Mapped$FinalEvent <- gsub('.*SURGE.*', 'STORM TIDE', x = Mapped$FinalEvent)
#storm surge/tide in their forms are officially called with storm tide
Mapped$FinalEvent <- gsub('.*MARINE HAIL.*', 'abcdefg', x = Mapped$FinalEvent)
Mapped$FinalEvent <- gsub('.*HAIL.*', 'HAIL', x = Mapped$FinalEvent)
Mapped$FinalEvent <- gsub('.*abcdefg.*', 'MARINE HAIL', x = Mapped$FinalEvent)
#anything with Hail other than Marine Hail is Hail
Mapped$FinalEvent <- gsub('.*ICE STORM.*', 'ICE STORM', x = Mapped$FinalEvent)
#Anything else with ICE STORM is infrequent so lump them together
Mapped$FinalEvent <- gsub('.*ROAD.*', 'ICE STORM', x = Mapped$FinalEvent)
#Anything with Road has to do with icy road so lump with ice storm
Mapped$FinalEvent <- gsub('^GUSTY.*', 'HIGH WIND', x = Mapped$FinalEvent)
#Gusty wind to high winds
Mapped$FinalEvent <- gsub('.*SLIDE.*', 'DEBRIS FLOW', x = Mapped$FinalEvent)
#Isolate landslides and mudslides
Mapped$FinalEvent <- gsub('.*URBAN.*', 'FLOOD', x = Mapped$FinalEvent)
#Change all the urban flooding to FLOOD
Mapped$FinalEvent <- gsub('.*FLASH.*', 'FLASH FLOOD', x = Mapped$FinalEvent)
#All remaining categories with flash in them are flash flood
Mapped$FinalEvent <- gsub('.*COASTAL FLOOD.*', 'COASTAL FLOOD', x = Mapped$FinalEvent)
Mapped$FinalEvent <- gsub('.*TIDAL FLOOD.*', 'COASTAL FLOOD', x = Mapped$FinalEvent)
#Anything with coastal or tidal flood get mapped to coastal flood
Mapped$FinalEvent <- gsub('.*LAKE FLOOD.*', 'LAKESHORE FLOOD', x = Mapped$FinalEvent)
#Rename lake flood with lakeshore flood from official names
Mapped$FinalEvent <- gsub('.*COASTAL FLOOD.*', 'XX1', x = Mapped$FinalEvent)
Mapped$FinalEvent <- gsub('.*FLASH FLOOD.*', 'XX2', x = Mapped$FinalEvent)
Mapped$FinalEvent <- gsub('.*LAKESHORE FLOOD.*', 'XX3', x = Mapped$FinalEvent)
Mapped$FinalEvent <- gsub('.*FLOOD.*', 'FLOOD', x = Mapped$FinalEvent)
Mapped$FinalEvent <- gsub('XX1', 'COASTAL FLOOD', x = Mapped$FinalEvent)
Mapped$FinalEvent <- gsub('XX2', 'FLASH FLOOD', x = Mapped$FinalEvent)
Mapped$FinalEvent <- gsub('XX3', 'LAKESHORE FLOOD', x = Mapped$FinalEvent)
#Any weirdly named flooding that isn't a known name will get grouped with generic flooding
Mapped$FinalEvent <- gsub('.*LIGHTNING.*', 'LIGHTNING', x = Mapped$FinalEvent)
#All lightning is lightning
Mapped$FinalEvent <- gsub('.*HIGH WINDS.*', 'HIGH WINDS', x = Mapped$FinalEvent)
Mapped$FinalEvent <- gsub('^HIGH WIND.*', 'HIGH WIND', x = Mapped$FinalEvent)
#Any high winds are just high winds without taking out marine high wind
Mapped$FinalEvent <- gsub('.*HEAVY RAIN.*', 'HEAVY RAIN', x = Mapped$FinalEvent)
#All remaining Heavy Rain are Heavy rain
Mapped$FinalEvent <- gsub('.+HEAT$', 'EXCESSIVE HEAT', x = Mapped$FinalEvent)
Mapped$FinalEvent <- gsub('^HEAT.*', 'HEAT', x = Mapped$FinalEvent)
#Anything that ended with heat was record or extreme heat, if it started with heat, it was a heat wave
Mapped$FinalEvent <- gsub('.*SURF.*', 'HIGH SURF', x = Mapped$FinalEvent)
#Group anything with surf into high surf
Mapped$FinalEvent <- gsub('.*EXTREME COLD.*', 'EXTREME COLD', x = Mapped$FinalEvent)
Mapped$FinalEvent <- gsub(' WIND CHILL$', '', x = Mapped$FinalEvent)
Mapped$FinalEvent <- gsub('^EXTREME$', 'EXTREME COLD', x = Mapped$FinalEvent)
#Gets rid of all Extreme cold and wind chill
MappedDict <- select(Mapped, Original, FinalEvent)
MappedDict$FinalEvent <- as.factor(MappedDict$FinalEvent)
NameList <- levels(as.factor(MappedDict$FinalEvent))
#select just the final values for analysis and rename finalevent as event
StormData <- left_join(StormData, MappedDict, by = c("Event"="Original")) %>%
select(Begin, Event = FinalEvent, Fatalities:TotalValue)
StormData$Event <- as.factor(StormData$Event)
Now we need to extract the year each event occurred
StormData$Begin <- parse_date_time(StormData$Begin, "%m%d%Y %H%M%S")
StormData <- mutate(StormData, Year = as.numeric(format(Begin,'%Y')))
SummaryStatistics <- group_by(StormData, Year, Event) %>%
summarise(Freq = n(), Fatalities = sum(Fatalities), Injuries = sum(Injuries), TotalBillions = sum(TotalValue)/1000000000) %>%
filter(Fatalities >0 | Injuries >0 | TotalBillions > 0) %>%
group_by(Year) %>%
mutate( ValueRank = rank(-TotalBillions)) %>%
mutate( FatalRank = rank(-Fatalities)) %>%
mutate( InjuryRank = rank(-Injuries)) %>%
arrange(Year, desc(TotalBillions), desc(Fatalities), desc(Injuries))
#year that more events are in so better data to compare each to
YearStart <- 1993
From splitting the data, you start to lose granularity if more than 15 or 20 event types are included in the analysis, I will be looking at the top 20 events by Fatalities, Injuries and Total Cost in damage.
RankNumber <- 6
#Keep only those years where the year is comparable to each other
SumStatsValid <- SummaryStatistics[SummaryStatistics$Year > YearStart,]
#Rank by fatalities
RankNumber <- 6
SumStatsValid <- arrange(SumStatsValid, Year, desc(Fatalities), desc(Injuries), desc(TotalBillions))
ggplot(SumStatsValid[SumStatsValid$FatalRank<RankNumber, ], aes(Year, Fatalities, fill = Event)) + geom_bar(stat="identity", position = "stack")
Here we can see the devastating effect of heat wave in chicago of 1995 Heat and tornados are some of the highest events that consistently cause devastation We can also see along the bottom how Excessive Heat, Heat and Tornados, rip currents and lightning all group towards teh bottom and are the worst.
#Rank by Injuries
RankNumber <- 6
SumStatsValid <- arrange(SumStatsValid, Year, desc(Injuries), desc(TotalBillions), desc(Fatalities))
ggplot(SumStatsValid[SumStatsValid$InjuryRank<RankNumber, ], aes(Year, Injuries, fill = Event)) + geom_bar(stat="identity", position = "stack")
We can see the biggest effects of Tornados in the injury ranking they’re also evident in the fatal Also of note are the number of injuries in 1998 from Hurricane Madeline in Oct of 1998 causing widespread flooding and injuries
RankNumber <-6
SumStatsValid <- arrange(SumStatsValid, Year, desc(TotalBillions), desc(Fatalities), desc(Injuries))
ggplot(SumStatsValid[SumStatsValid$ValueRank<RankNumber,], aes(Year, TotalBillions, fill = Event))+geom_bar(stat="identity", position = "stack")
Here the huge effect of Hurricane Katrina and its followup flooding causing record breaking damage. Hurricanes are commonly the biggest cause of damage along with tornadoes and flooding.
TotalEffect <- SumStatsValid %>%
group_by(Event) %>%
summarise(Freq = n(), Fatal = sum(Fatalities), Injured = sum(Injuries), Billions = sum(TotalBillions)) %>%
arrange(desc(Fatal), desc(Injured), desc(Billions)) %>%
mutate( FatalRank = rank(-Fatal)) %>%
mutate( InjuryRank = rank(-Injured)) %>%
mutate( ValueRank = rank(-Billions))
I don’t think there’s a good way to weight injuries vs. fatalities, no number of injuries could ever compare to a death so I will rank them seperately.
Events in the top 20 by Fatalities:
TotalEffect[TotalEffect$FatalRank<21,] %>% arrange(desc(Fatal)) %>% select(Event, Fatalities = Fatal)
## Source: local data frame [20 x 2]
##
## Event Fatalities
## (fctr) (dbl)
## 1 EXCESSIVE HEAT 2020
## 2 TORNADO 1593
## 3 HEAT 1111
## 4 FLASH FLOOD 1007
## 5 LIGHTNING 795
## 6 RIP CURRENT 572
## 7 FLOOD 487
## 8 THUNDERSTORM WIND 434
## 9 EXTREME COLD 277
## 10 HIGH WIND 263
## 11 AVALANCHE 224
## 12 WINTER STORM 195
## 13 HIGH SURF 165
## 14 HURRICANE/TYPHOON 135
## 15 COLD 127
## 16 HEAVY SNOW 123
## 17 STRONG WIND 110
## 18 HEAVY RAIN 98
## 19 ICE STORM 92
## 20 WILDFIRE 87
Events in the top 20 by Injuries
TotalEffect[TotalEffect$InjuryRank<21,]%>% arrange(desc(Injured)) %>% select(Event, Injured)
## Source: local data frame [20 x 2]
##
## Event Injured
## (fctr) (dbl)
## 1 TORNADO 22589
## 2 FLOOD 6861
## 3 EXCESSIVE HEAT 6730
## 4 THUNDERSTORM WIND 5949
## 5 LIGHTNING 5118
## 6 HEAT 2489
## 7 ICE STORM 2018
## 8 FLASH FLOOD 1772
## 9 WILDFIRE 1456
## 10 HURRICANE/TYPHOON 1332
## 11 HIGH WIND 1320
## 12 WINTER STORM 1298
## 13 HEAVY SNOW 980
## 14 HAIL 953
## 15 FOG 734
## 16 RIP CURRENT 529
## 17 DUST STORM 439
## 18 WINTER WEATHER 398
## 19 BLIZZARD 390
## 20 TROPICAL STORM 383
Events in the top 20 of both
TotalEffect[TotalEffect$FatalRank<21 & TotalEffect$InjuryRank<21,] %>% select(Event, Fatalities = Fatal, Injured)
## Source: local data frame [14 x 3]
##
## Event Fatalities Injured
## (fctr) (dbl) (dbl)
## 1 EXCESSIVE HEAT 2020 6730
## 2 TORNADO 1593 22589
## 3 HEAT 1111 2489
## 4 FLASH FLOOD 1007 1772
## 5 LIGHTNING 795 5118
## 6 RIP CURRENT 572 529
## 7 FLOOD 487 6861
## 8 THUNDERSTORM WIND 434 5949
## 9 HIGH WIND 263 1320
## 10 WINTER STORM 195 1298
## 11 HURRICANE/TYPHOON 135 1332
## 12 HEAVY SNOW 123 980
## 13 ICE STORM 92 2018
## 14 WILDFIRE 87 1456
All these events need to be treated with an appropriate amount of respect because they cause an incredibly amount of hurt. Events in the top 20 by Damage
TotalEffect[TotalEffect$ValueRank<21,] %>% arrange(desc(Billions)) %>% select(Event, Billions)
## Source: local data frame [20 x 2]
##
## Event Billions
## (fctr) (dbl)
## 1 FLOOD 150.2574504
## 2 HURRICANE/TYPHOON 90.8165278
## 3 STORM TIDE 47.8355790
## 4 TORNADO 25.9971928
## 5 FLASH FLOOD 18.6725961
## 6 HAIL 18.5850748
## 7 DROUGHT 14.9681720
## 8 THUNDERSTORM WIND 10.7469470
## 9 ICE STORM 8.8553446
## 10 TROPICAL STORM 8.4082766
## 11 WILDFIRE 8.2753451
## 12 HIGH WIND 6.1467907
## 13 HEAVY RAIN 4.0041309
## 14 WINTER STORM 1.5720413
## 15 EXTREME COLD 1.3889084
## 16 SEVERE THUNDERSTORM 1.2055600
## 17 FROST FREEZE 1.1046660
## 18 LIGHTNING 0.8955079
## 19 HEAVY SNOW 0.8675707
## 20 BLIZZARD 0.5392189
Flooding, Hurricanes and Tornadoes cause a huge amount of damage.
Evemts that are in all 3 top 20:
TotalEffect[TotalEffect$FatalRank<21 & TotalEffect$InjuryRank<21 & TotalEffect$ValueRank<21,] %>% select(Event, Fatalities = Fatal, Injured, Billions)
## Source: local data frame [11 x 4]
##
## Event Fatalities Injured Billions
## (fctr) (dbl) (dbl) (dbl)
## 1 TORNADO 1593 22589 25.9971928
## 2 FLASH FLOOD 1007 1772 18.6725961
## 3 LIGHTNING 795 5118 0.8955079
## 4 FLOOD 487 6861 150.2574504
## 5 THUNDERSTORM WIND 434 5949 10.7469470
## 6 HIGH WIND 263 1320 6.1467907
## 7 WINTER STORM 195 1298 1.5720413
## 8 HURRICANE/TYPHOON 135 1332 90.8165278
## 9 HEAVY SNOW 123 980 0.8675707
## 10 ICE STORM 92 2018 8.8553446
## 11 WILDFIRE 87 1456 8.2753451
Tornados, (Flash) Floods, Lightning, Thunderstorms and their winds, Winter storms in all varieties and Hurricanes are overall the worst events that can happen right now.