This analysis looks at the human and economic damage caused by severe weather events.
It uses the NOAA Storm Database.
This is completed for Reproducible Research: Peer Assessment 2.
This report will answer two questions:
Across the United States, which types of events are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?
I load and explore the data first.
Because the integrety of the event variable is so bad, I create groups based on specifice words and completed the analysis with these groups.
library(plyr)
library(rCharts)
library(reshape2)
if(!file.exists("repdata-data-StormData.csv.bz2")) {
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
"repdata-data-StormData.csv.bz2",
method = "curl")
}
df <- read.csv("repdata-data-StormData.csv.bz2")
The event types are incredibly variable
For example there are 985 event types
The phrase 'wind' occurs in 220 event types
I will just show the top and bottom 20
## Total event types
length(unique(df$EVTYPE))
[1] 985
## Find occurances of 'wind'
tmp <- grep('wind', tolower(df$EVTYPE), value=T)
length(table(tmp))
[1] 220
head(arrange(as.data.frame(table(tmp)),desc(Freq)),20)
tmp Freq
1 tstm wind 219942
2 thunderstorm wind 82564
3 thunderstorm winds 20843
4 high wind 20214
5 marine tstm wind 6175
6 marine thunderstorm wind 5812
7 strong wind 3569
8 high winds 1533
9 tstm wind/hail 1028
10 extreme cold/wind chill 1002
11 cold/wind chill 539
12 wind 346
13 extreme windchill 204
14 strong winds 204
15 marine high wind 135
16 gusty winds 65
17 thunderstorm winds hail 61
18 thunderstorm windss 51
19 marine strong wind 48
20 tstm wind (g45) 39
tail(arrange(as.data.frame(table(tmp)),desc(Freq)),20)
tmp Freq
201 tornadoes, tstm wind, hail 1
202 tstm wind (g45) 1
203 tstm wind (41) 1
204 tstm wind (g35) 1
205 tstm wind 40 1
206 tstm wind 45 1
207 tstm wind 50 1
208 tstm wind 65) 1
209 tstm wind and lightning 1
210 tstm wind damage 1
211 tstm wind g45 1
212 tstm wind g58 1
213 tunderstorm wind 1
214 wind and wave 1
215 wind chill/high wind 1
216 wind storm 1
217 wind/hail 1
218 winter storm high winds 1
219 winter storm/high wind 1
220 winter storm/high winds 1
This is bad data integrity.
Its a huge problems with the data set.
For this analysis, I will scan the top 100 events and put them into logical groups.
I am showing the top 20 here.
## top 20 event occurences
tmp <- arrange(as.data.frame(table(df$EVTYPE)),desc(Freq))
head(tmp,20)
Var1 Freq
1 HAIL 288661
2 TSTM WIND 219940
3 THUNDERSTORM WIND 82563
4 TORNADO 60652
5 FLASH FLOOD 54277
6 FLOOD 25326
7 THUNDERSTORM WINDS 20843
8 HIGH WIND 20212
9 LIGHTNING 15754
10 HEAVY SNOW 15708
11 HEAVY RAIN 11723
12 WINTER STORM 11433
13 WINTER WEATHER 7026
14 FUNNEL CLOUD 6839
15 MARINE TSTM WIND 6175
16 MARINE THUNDERSTORM WIND 5812
17 WATERSPOUT 3796
18 STRONG WIND 3566
19 URBAN/SML STREAM FLD 3392
20 WILDFIRE 2761
Now I create a new variable and set the event type.
I go from general groups to more specific groups.
Each event can have only one type.
df$event_type <- NA
df$event_type[grep('heat|warm', tolower(df$EVTYPE))] <- 'heat'
df$event_type[grep('cold', tolower(df$EVTYPE))] <- 'cold'
df$event_type[grep('wind', tolower(df$EVTYPE))] <- 'wind'
df$event_type[grep('surf|current|tide', tolower(df$EVTYPE))] <- 'ocean'
df$event_type[grep('snow|winter|wintry|sleet|blizzard|ice|freeze|avalanche', tolower(df$EVTYPE))] <- 'snow'
df$event_type[grep('rain', tolower(df$EVTYPE))] <- 'rain'
df$event_type[grep('hail', tolower(df$EVTYPE))] <- 'hail'
df$event_type[grep('flood|fld', tolower(df$EVTYPE))] <- 'flood'
df$event_type[grep('tornado|funnel|waterspout|devil', tolower(df$EVTYPE))] <- 'tornado'
df$event_type[grep('hurricane|depression', tolower(df$EVTYPE))] <- 'hurricane'
df$event_type[grep('lightning', tolower(df$EVTYPE))] <- 'lightning'
df$event_type[grep('fog', tolower(df$EVTYPE))] <- 'fog'
df$event_type[grep('fire', tolower(df$EVTYPE))] <- 'fire'
df$event_type[grep('drought', tolower(df$EVTYPE))] <- 'drought'
df$event_type[grep('landslide', tolower(df$EVTYPE))] <- 'landslide'
df$event_type[is.na(df$event_type)] <- 'other'
Lets look at the number of events in each group.
df2 <- ddply(df, .(event_type), summarise,
count = length(EVTYPE)
)
df2 <- arrange(df2, desc(count))
df2
event_type count
1 wind 363686
2 hail 290398
3 flood 86127
4 tornado 71686
5 snow 44080
6 lightning 15775
7 rain 12210
8 fire 4240
9 other 3233
10 heat 2958
11 drought 2512
12 ocean 2269
13 fog 1883
14 cold 892
15 hurricane 348
Wind and hail have many more occurences than other events. 363,686 and 290,398 respectively.
Flood, tornado, snow, lightning, rain have between 10,000 and 90,000 occurences.
The others have less than 5,000 occurences
Now we can break down human and economic damage by major groups.
Lets look at the human damage in Total
df2 <- ddply(df, .(event_type), summarise,
fatalities = sum(FATALITIES),
injuries = sum(INJURIES)
)
df2 <- arrange(df2, desc(injuries))
df3 <- melt(df2, id.vars = c('event_type'))
p1 <- nPlot(value ~ event_type, group = 'variable', data = df3, type = 'multiBarHorizontalChart')
p1$chart(stacked = TRUE)
p1$show('inline', include_assets = TRUE, cdn = TRUE)
Figure 1: Total human fatalities and injuries by event type
Tornados have caused the most injuries and fatalities.
Wind, heat, flood, snow, and lightning are next.
Lets look at the human damage on Per Event
df2 <- ddply(df, .(event_type), summarise,
fatalities = mean(FATALITIES),
injuries = mean(INJURIES)
)
df2 <- arrange(df2, desc(injuries))
df3 <- melt(df2, id.vars = c('event_type'))
p1 <- nPlot(value ~ event_type, group = 'variable', data = df3, type = 'multiBarHorizontalChart')
p1$chart(stacked = TRUE)
p1$show('inline', include_assets = TRUE, cdn = TRUE)
Figure 2: Human fatalities and injuries per event by event type
In terms of individual events Hurricanes and Heat events are far more dangerous.
Tornadoes, Ocean, and Cold are next.
Lets look at the economic damage in Total and Per Event.
Choose which group to view in the legend.
df2 <- ddply(df, .(event_type), summarise,
property_damage_total = sum(PROPDMG),
property_damage_mean = mean(PROPDMG)
)
df2 <- arrange(df2, desc(property_damage_total))
df3 <- melt(df2, id.vars = c('event_type'))
p1 <- nPlot(value ~ event_type, group = 'variable', data = df3, type = 'multiBarHorizontalChart')
p1$chart(stacked = FALSE)
p1$show('inline', include_assets = TRUE, cdn = TRUE)
Figure 3: Total and mean property damage by event type
Tornados, Wind, and Flood have done the most damage. Per event, Hurricane, Tornados, and Lightning are the most damaging