This analysis reviews data collected from NOAA over the past 50 years to determine what weather events are most harmful to the population, and what weather events have the greatest economic consequences.
With respect to the health of the population, tornadoes are clearly the most dangerous weather event for both injuries and fatalities: almost three times more deadly than excessive heat, and almost nine times more injuries than thunderstorm wind. The specific arrangement of events differ for these two factors after tornadoes, but thunderstorm wind, heat, excessive heat, flood, flash flood, lightning, winter storm and high wind are among the top 10 for both injuries and fatalities.
The greatest economic consequence is clearly due to floods, which costs around three times more than the second place hurricane/typhoons. These two categories are followed by Tornadoes, Hail, and flash floods.
Packages to be included in this analysis are listed below.
require("dplyr")
require("ggplot2")
require("gridExtra")
The raw data was downloaded from the following.
if(!file.exists("./data/NOAA.csv.bz2")) {
dir.create("data")
download.file(url = "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
destfile = "./data/NOAA.csv.bz2")
}
The raw data is read into R, and the structure can be seen below for reference.
NOAA <- as_tibble(read.csv("./data/NOAA.csv.bz2", stringsAsFactor = FALSE))
str(NOAA)
## Classes 'tbl_df', 'tbl' and 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
We will look at events that caused physical harm to the population, namely: injuries and fatalities.
health <- NOAA %>%
mutate(EVTYPE = toupper(EVTYPE)) %>%
filter(FATALITIES > 0 | INJURIES > 0) %>%
group_by(EVTYPE) %>%
summarize(Fatalities = sum(FATALITIES), Injuries = sum(INJURIES)) %>%
arrange(desc(Injuries),desc(Fatalities))
length(unique(health$EVTYPE))
## [1] 205
After processing, there are 205 levels of the event type, as you can see. However, according to the National Weather Service Storm Data Documentation, there are only 48 categories of the event types. You can clearly see that there are multiple items, some misspellings, that should all be grouped together.
grep("THUNDERSTORM",health$EVTYPE, value = TRUE)
## [1] "THUNDERSTORM WIND" "THUNDERSTORM WINDS"
## [3] "THUNDERSTORMW" "MARINE THUNDERSTORM WIND"
## [5] "THUNDERSTORM" "THUNDERSTORM WINDS"
## [7] "THUNDERSTORM WINDSS" "LIGHTNING AND THUNDERSTORM WIN"
## [9] "THUNDERSTORM WINDS 13" "THUNDERSTORM WINDS/HAIL"
## [11] "THUNDERSTORMS WINDS" "THUNDERSTORM WIND (G40)"
## [13] "THUNDERSTORM WIND G52"
I subsetted the “health” dataset first, so that the list of EVTYPES was a bit smaller. The full NOAA dataset has 985 unique types. It does not make sense to go through all of those lines manually, if we only need to parse 205 types.
This is a bit of a labor intensive, and a long list of code, but below is the code to clean up the majority of these categories. There are some that have not been cleaned up, simply due to lack of information. (e.g. “DROWNING” is a cause of death, and does not specify the event, such as “FLOOD” or “FLASH FLOOD”, etc.)
health <- NOAA %>%
mutate(EVTYPE = toupper(EVTYPE)) %>%
filter(FATALITIES > 0 | INJURIES > 0)
clean_events <- function(health) {
health[grep("AVALANCE",health$EVTYPE),"EVTYPE"] <- "AVALANCHE"
health[grep("FIRE",health$EVTYPE),"EVTYPE"] <- "WILDFIRE"
health[grep("BLACK ICE|BLOWING SNOW|FALLING SNOW/ICE|ICE ROAD|ICY|ICE ON|LIGHT SNOW|^SNOW|^WINTER WEATHER|WINTRY|^ICE$|ICE AND|ICE F|LATE SEASON SNOW|WINTER WEATHER",health$EVTYPE),"EVTYPE"] <- "WINTER WEATHER"
health[grep("COASTAL|CSTL|EROSION",health$EVTYPE),"EVTYPE"] <- "COASTAL FLOOD"
health[grep("^COLD|LOW TEMP|UNSEASONABL*. COLD|^COOL",health$EVTYPE),"EVTYPE"] <- "COLD/WIND CHILL"
health[grep("DROUGT/EXCESSIVE HEAT",health$EVTYPE),"EVTYPE"] <- "DROUGHT"
health[grep("MICROBURST|GUSTY|HIGH WIND|WIND DAMAGE|WINDS|^WIND$|GRADIENT WIND|^WIND",health$EVTYPE),"EVTYPE"] <- "HIGH WIND"
health[grep("^STRONG WIND|DOWNBURST|TURBULENCE",health$EVTYPE),"EVTYPE"] <- "STRONG WIND"
health[grep("EXCESSIVE RAINFALL|^HEAVY RAIN|TORRENTIAL|WETNESS|HEAVY PRECIP|SHOWER|RAIN",health$EVTYPE),"EVTYPE"] <- "HEAVY RAIN"
health[grep("EXTENDED COLD|EXTREME WIND*.CHILL|RECORD COLD|EXTREME COLD",health$EVTYPE),"EVTYPE"] <- "EXTREME COLD/WIND CHILL"
health[grep("EXTREME HEAT|RECORD HEAT|EXCESSIVE HEAT",health$EVTYPE),"EVTYPE"] <- "EXCESSIVE HEAT"
health[grep("FLASH FLOOD|RAPIDLY RISING WATER|DAM BREAK",health$EVTYPE),"EVTYPE"] <- "FLASH FLOOD"
health[grep("^FLOOD|RIVER*.FLOOD|^UR|^BREAKUP|MAJOR FLOOD|MINOR FLOOD|RURAL FLOOD|STREAM FLOOD|ICE JAM",health$EVTYPE),"EVTYPE"] <- "FLOOD"
health[grep("FOG AND COLD",health$EVTYPE),"EVTYPE"] <- "FREEZING FOG"
health[grep("FOG",health$EVTYPE),"EVTYPE"] <- "DENSE FOG"
health[grep("FREEZE|FROST",health$EVTYPE),"EVTYPE"] <- "FROST/FREEZE"
health[grep("FREEZING|HEAVY MIX|MIXED PRECIP",health$EVTYPE),"EVTYPE"] <- "SLEET"
health[grep("GLAZE",health$EVTYPE),"EVTYPE"] <- "WINTER STORM"
health[grep("SURF|HEAVY SEA|HEAVY SURF|HIGH SEA|HIGH SURF|HIGH WATER|HIGH WAVE|ROGUE WAVE|^ROUGH|SWELL",health$EVTYPE),"EVTYPE"] <- "HIGH SURF"
health[grep("^HEAT|UNSEASONABLY WARM|^WARM",health$EVTYPE),"EVTYPE"] <- "HEAT"
health[grep("^HEAVY SNOW|EXCESSIVE SNOW|RECORD SNOW",health$EVTYPE),"EVTYPE"] <- "HEAVY SNOW"
health[grep("HURRICANE|TYPHOON",health$EVTYPE),"EVTYPE"] <- "HURRICANE/TYPHOON"
health[grep("ICE STORM",health$EVTYPE),"EVTYPE"] <- "ICE STORM"
health[grep("LANDSLIDE|MUDSLIDE|LANDSLUMP|LANDSPOUT|SLIDE",health$EVTYPE),"EVTYPE"] <- "DEBRIS FLOW"
health[grep("LIGHTNING|LIGHTING|LIGNTNING",health$EVTYPE),"EVTYPE"] <- "LIGHTNING"
health[grep("MARINE STRONG",health$EVTYPE),"EVTYPE"] <- "MARINE HIGH WIND"
health[grep("RIP CURRENT",health$EVTYPE),"EVTYPE"] <- "RIP CURRENT"
health[grep("HAIL",health$EVTYPE),"EVTYPE"] <- "HAIL"
health[grep("STORM SURGE|TIDE",health$EVTYPE),"EVTYPE"] <- "STORM TIDE"
health[grep("THUNDERSTORM|TSTM|THUNDERTORM|WIND STORM|THUNDERSTROM WIND|TUNDERSTORM WIND",health$EVTYPE),"EVTYPE"] <- "THUNDERSTORM WIND"
health[grep("THUNDERSNOW|^WINTER STORM|BLIZZARD|ICE STORM",health$EVTYPE),"EVTYPE"] <- "WINTER STORM"
health[grep("TIDAL FLOODING|LAKE*.FLOOD",health$EVTYPE),"EVTYPE"] <- "LAKESHORE FLOOD"
health[grep("TORNADO|WHIRLWIND|GUSTNADO|TORNDAO",health$EVTYPE),"EVTYPE"] <- "TORNADO"
health[grep("TROPICAL STORM",health$EVTYPE),"EVTYPE"] <- "TROPICAL STORM"
health[grep("WATERSP",health$EVTYPE),"EVTYPE"] <- "WATERSPOUT"
health[grep("BLOWING DUST",health$EVTYPE),"EVTYPE"] <- "DUST STORM"
health[grep("LAKE*.SNOW",health$EVTYPE),"EVTYPE"] <- "LAKE-EFFECT SNOW"
health
}
health <- clean_events(health) %>%
group_by(EVTYPE) %>%
summarize(Fatalities = sum(FATALITIES, na.rm=TRUE), Injuries = sum(INJURIES, na.rm=TRUE))
top_injury_data <- health %>%
select(EVTYPE,Injuries) %>%
arrange(desc(Injuries)) %>%
top_n(10)
top_fatality_data <- health %>%
select(EVTYPE,Fatalities) %>%
arrange(desc(Fatalities)) %>%
top_n(10)
length(unique(health$EVTYPE))
## [1] 44
We have parsed this down to 44 categories, which should give us a much better answer on the total injuries and fatalities caused by these events. The final results can be seen in the results section below.
Here we will process the data to look at economic factors, namely: property damage and crop damage. The variables needed here are “PROPDMG” and “CROPDMG” which gives the numeric dollar value in property and crop damage, and “PROPDMGEXP” and “CROPDMGEXP” which give the suffix of the dollars (k for thousand, M for million, B for billion).
However, the EXP variables have some messy data:
unique(NOAA$PROPDMGEXP)
## [1] "K" "M" "" "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-" "1" "8"
unique(NOAA$CROPDMGEXP)
## [1] "" "M" "K" "m" "B" "?" "0" "k" "2"
If we look to see how much of a problem this is, we see these incorrect values only account for 0.04% of the data. These will simply be removed from the analysis.
nrow(NOAA[which(!(NOAA$PROPDMGEXP %in% toupper(c("K","B","M","")))),])
## [1] 328
nrow(NOAA[which(!(NOAA$CROPDMGEXP %in% toupper(c("K","B","M","")))),])
## [1] 49
(nrow(NOAA[which(!(NOAA$PROPDMGEXP %in% toupper(c("K","B","M","")))),]) +
nrow(NOAA[which(!(NOAA$CROPDMGEXP %in% toupper(c("K","B","M","")))),])) /
nrow(NOAA) * 100
## [1] 0.042
Filtering the data to look at any economic impact, we will again take a look at which event types we need to clean up.
economic <- NOAA %>%
mutate(EVTYPE = toupper(EVTYPE)) %>%
mutate(PROPDMGEXP = toupper(PROPDMGEXP)) %>%
mutate(CROPDMGEXP = toupper(CROPDMGEXP)) %>%
filter(CROPDMG > 0 | PROPDMG > 0)
length(unique(economic$EVTYPE))
## [1] 397
Just like with the health data, we have a bunch of event types that need to be cleaned up, with 397 unique event types. We will simply run this through the same clean_events function we created earlier for the health dataset.
economic <- clean_events(economic)
length(unique(economic$EVTYPE))
## [1] 46
Which gives us a much cleaner 46 event types to analyze.
Now, we will need to modify the actual property and crop damage costs. They are both split into two variables: a numeric column, and a text column stating the suffix (thousand, million, billion). This needs to be modified to a single variable for both property and crop damage, i.e. PROPDMG value of 25.5 and PROPDMGEXP value of “M” needs to be combined to a value of “25,500,000”.
economic$PROPDMGEXP <- with(economic,case_when(PROPDMGEXP=="K" ~ 1000, PROPDMGEXP=="M" ~ 1000000, PROPDMGEXP=="B" ~ 1000000000, PROPDMGEXP == NA ~ 1))
economic$CROPDMGEXP <- with(economic,case_when(CROPDMGEXP=="K" ~ 1000, CROPDMGEXP=="M" ~ 1000000, CROPDMGEXP=="B" ~ 1000000000, CROPDMGEXP == NA ~ 1))
top_economic <- economic %>%
mutate(DMG = PROPDMG * PROPDMGEXP + CROPDMG * CROPDMGEXP) %>%
group_by(EVTYPE) %>%
summarize(Damage = sum(DMG, na.rm=TRUE)) %>%
arrange(desc(Damage)) %>%
top_n(10)
The data has now been processed, and the results of the data can be seen below.
We can take a look at our health dataset and see the 10 events that caused the most injuries, and the 10 events that caused the most fatalities.
top_injury_data
## # A tibble: 10 x 2
## EVTYPE Injuries
## <chr> <dbl>
## 1 TORNADO 91407
## 2 THUNDERSTORM WIND 8523
## 3 FLOOD 6873
## 4 EXCESSIVE HEAT 6730
## 5 LIGHTNING 5232
## 6 WINTER STORM 4350
## 7 HIGH WIND 2568
## 8 HEAT 2513
## 9 FLASH FLOOD 1802
## 10 WILDFIRE 1608
top_fatality_data
## # A tibble: 10 x 2
## EVTYPE Fatalities
## <chr> <dbl>
## 1 TORNADO 5637
## 2 EXCESSIVE HEAT 2020
## 3 HEAT 1158
## 4 FLASH FLOOD 1036
## 5 LIGHTNING 817
## 6 THUNDERSTORM WIND 660
## 7 RIP CURRENT 572
## 8 FLOOD 511
## 9 WINTER STORM 414
## 10 HIGH WIND 405
injury_plot <- ggplot(top_injury_data,aes(x=reorder(EVTYPE, -Injuries), y=Injuries))+
geom_bar(stat="identity") +
theme(axis.text.x = element_text(angle = 90)) +
labs(x="Event Type", y="Number of Injuries", title="Most Harmful Weather Events Related to Injuries")
fatality_plot <- ggplot(top_fatality_data,aes(x=reorder(EVTYPE, -Fatalities), y=Fatalities))+
geom_bar(stat="identity") +
theme(axis.text.x = element_text(angle = 90)) +
labs(x="Event Type", y="Number of Fatalities", title="Most Harmful Weather Events Related to Fatalities")
grid.arrange(injury_plot, fatality_plot, ncol=2)
The top 10 events that caused the greatest economic consequences can be seen below. the greatest economic consequences is being defined as the total sum of crop damage and property damage. Unlike injuries and fatalities, these are both in similar terms of dollars and could be simply combined.
top_economic
## # A tibble: 10 x 2
## EVTYPE Damage
## <chr> <dbl>
## 1 FLOOD 148345140700
## 2 HURRICANE/TYPHOON 44220000800
## 3 TORNADO 16520218150
## 4 HAIL 11652348140
## 5 FLASH FLOOD 9223297180
## 6 WINTER STORM 7175061800
## 7 THUNDERSTORM WIND 4987249500
## 8 STORM TIDE 4644738000
## 9 HIGH WIND 3970846450
## 10 WILDFIRE 3838549570
ggplot(top_economic, aes(x=reorder(EVTYPE, -Damage), y=Damage))+
geom_bar(stat="identity") +
theme(axis.text.x = element_text(angle = 90)) +
labs(x="Event Type", y="Economic Damage ($)", title="Most Harmful Weather Events Related to Crop and Property Damage")