The National Weather Service storm dataset recorded the types and impact of severe weather events from 1950 - 2011. This analysis explores the storm dataset and answer 2 questions about these weather events:
Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?
setwd("/Users/yingjiang/Dropbox/Education/Coursera/data_science_spec/data_science_c5/Projects/Project2")
download.file(url = "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
destfile = "stormdata.csv",
method = "curl")
stormdata <- read.csv("stormdata.csv")
From the following document:
NATIONAL WEATHER SERVICE INSTRUCTION 10-1605; AUGUST 17, 2007; Operations and Services; Performance, NWSPD 10-16; STORM DATA PREPARATION; Pg 6: Table 1 (Table of allowed event types)
… copied the allowed event types from Table, into Microsoft Word (for Mac 2011, Version 14.1.0(130310)).
Copied the text into Microsoft Excel (for Mac 2011, Version 14.1.0(130310)).
Saved as csv into current folder.
Cleaned the table:
removed white space and extraneous letter at the end
changed to upper case
eventtype <- read.csv("Eventtype.csv", header = F, stringsAsFactors = F)
eventtype <- substr(eventtype$V1, 1, nchar(eventtype$V1) - 2)
eventtype <- toupper(eventtype)
eventtype
## [1] "ASTRONOMICAL LOW TIDE" "AVALANCHE"
## [3] "BLIZZARD" "COASTAL FLOOD"
## [5] "COLD/WIND CHILL" "DEBRIS FLOW"
## [7] "DENSE FOG" "DENSE SMOKE"
## [9] "DROUGHT" "DUST DEVIL"
## [11] "DUST STORM" "EXCESSIVE HEAT"
## [13] "EXTREME COLD/WIND CHILL" "FLASH FLOOD"
## [15] "FLOOD" "FROST/FREEZE"
## [17] "FUNNEL CLOUD" "FREEZING FOG"
## [19] "HAIL" "HEAT"
## [21] "HEAVY RAIN" "HEAVY SNOW"
## [23] "HIGH SURF" "HIGH WIND"
## [25] "HURRICANE (TYPHOON)" "ICE STORM"
## [27] "LAKE-EFFECT SNOW" "LAKESHORE FLOOD"
## [29] "LIGHTNING" "MARINE HAIL"
## [31] "MARINE HIGH WIND" "MARINE STRONG WIND"
## [33] "MARINE THUNDERSTORM WIND" "RIP CURRENT"
## [35] "SEICHE" "SLEET"
## [37] "STORM SURGE/TIDE" "STRONG WIND"
## [39] "THUNDERSTORM WIND" "TORNADO"
## [41] "TROPICAL DEPRESSION" "TROPICAL STORM"
## [43] "TSUNAMI" "VOLCANIC ASH"
## [45] "WATERSPOUT" "WILDFIRE"
## [47] "WINTER STORM" "WINTER WEATHER"
There are 48 allowed events.
Attached an empty column to stormdata to later store the tidy events.
stormdata1 <- cbind(stormdata, character(length(stormdata$EVTYPE)))
colnames(stormdata1)[ncol(stormdata1)] <- "EVTYPE_TIDY"
stormdata1$EVTYPE <- factor(toupper(stormdata1$EVTYPE))
stormdata1$EVTYPE_TIDY <- character(length(stormdata1$EVTYPE_TIDY))
Populated rows that correspond exactly to the allowed event types.
for(i in 1:length(eventtype)) {
stormdata1$EVTYPE_TIDY[stormdata1$EVTYPE == eventtype[i]] <- eventtype[i]
}
sum(stormdata1$EVTYPE_TIDY == "")
## [1] 266946
There are still 266946, out of 902297, cases of unclassified EVTYPE, which didn’t correspond to the 48 allowed events.
NOTE: Certain observations contained keywords that corresponded to multiple allowed-events (e.g. “blizzard” and “heavy snow” are classified differently in the allowed-events list, but could appear in the description for the same observation.) This effect was ignored, and the alphabetically latter occurence (in this case, “heavy snow”) was taken to be the tidy event name.
The unclassified EVTYPE were extracted into a separate vector for clarity.
b <- stormdata1$EVTYPE[stormdata1$EVTYPE_TIDY == ""]
Tidied up these event types further, as much as possible, by inspection. The goal is to:
Remove duplicates.
Correct spelling.
Group similar meanings together.
tofind <- c("(.*)WIND(.*)|(.*)GUST(.*)", # into "STRONG WIND"
"(.*)DRY(.*)|DRIEST|LOW RAIN|BELOW NORMAL PRECI(.*)", #into "DROUGHT"
"HOT|HIGH TEMPERATURE", # INTO "HEAT"
"(.*)WARMTH(.*)|(.*)WARM(.*)", # into "HEAT"
"LAKE(.*)EFFECT", # into "LAKE-EFFECT SNOW"
"(.*)SNOW(.*)", # into "HEAVY SNOW"
"(.*)WET(.*)",
"(.*)RAIN(.*)",
"(.*)PRECIPITATION(.*)",
"(.*)PRECIPATATION(.*)",
"(.*)SHOWER(.*)", # into "HEAVY RAIN"
"(.*)FREEZE(.*)|(.*)FREEZING(.*)|LOW TEMP(.*)|(.*)ICE(.*)|(.*)THERMIA(.*)|(.*)FROST(.*)", # into "FROST/FREEZE"
"(.*)DUST(.*)", # into "DUST STORM"
"AVALAN(.*)", # into "AVALANCHE"
"(.*)WIND(.*)CHILL(.*)|(.*)WIND(.*)COLD(.*)", # Into "COLD/WIND CHILL"
"(.*)CHILL(.*)WIND(.*)|(.*)COLD(.*)WIND(.*)", # Into "COLD/WIND CHILL"
"(.*)FIRE(.*)", # Into "WILDFIRE"
"COASTAL(.*)", # INTO "COASTAL FLOOD"
"(.*)FLOOD(.*)|(.*)URBAN(.*)", # into "FLOOD"
"(.*)FOG(.*)", # into "FREEZING FOG"
"(.*)FUNNEL(.*)", # into "FUNNEL CLOUD"
"(.*)SURF(.*)", # into "HIGH SURF"
"HIGH WAVES|HIGH TIDES|HIGH SEAS|HIGH SWELLS|ROGUE WAVES|ROUGH SEAS", # into "HIGH SURF"
"HURRICANE|TYPHOON", # into "HURRICANE (TYPHOON)"
"LIGHTING|LIGNTNING|LIGHTNING(.*)", # into "LIGHTNING"
"SMOKE", # into "DENSE SMOKE"
"STORM SURGE", # into "STORM SURGE/TIDE"
"TORNDAO|TORNADOS|TORNADOES|TORNADO(.*)", # into "TORNADO"
"(.*)WATER(.*)SPOUT(.*)|WAYTERSPOUT", # into "WATERSPOUT
"TSTM(.*)|THUDERSTORM|TUNDERSTORM|(.*)THUNDERSTORM(.*)", # into "THUNDERSTORM WIND". Comprises of "THUNDERSTORMS", "THUNDERSTORMW", "THUNDERSTROM"
"TROPICAL STORM(.*)", # into "TROPICAL STORM"
"VOLCANIC(.*)", # into "VOLCANIC ASH"
"(.*)WINTER(.*)|(.*)WINTRY(.*)|(.*)WINTERY(.*)" # into "WINTER WEATHER"
)
toreplace <- c("STRONG WIND",
"DROUGHT",
"HEAT", "HEAT",
"LAKE-EFFECT SNOW",
"HEAVY SNOW",
"HEAVY RAIN", "HEAVY RAIN", "HEAVY RAIN", "HEAVY RAIN", "HEAVY RAIN",
"FROST/FREEZE",
"DUST STORM",
"AVALANCHE",
"COLD/WIND CHILL", "COLD/WIND CHILL",
"WILDFIRE",
"COASTAL FLOOD", "FLOOD",
"FREEZING FOG",
"FUNNEL CLOUD",
"HIGH SURF", "HIGH SURF",
"HURRICANE (TYPHOON)",
"LIGHTNING",
"DENSE SMOKE",
"STORM SURGE/TIDE",
"TORNADO",
"WATERSPOUT",
"THUNDERSTORM WIND",
"TROPICAL STORM",
"VOLCANIC ASH",
"WINTER WEATHER"
)
for(i in 1:length(tofind)) {
b <- gsub(tofind[i], toreplace[i], b)
}
Created another empty vector that correspond to the unclassified EVTYPE (all empty entries from EVTYPE_TIDY), and filled it with the corresponding allowed event keywords.
EVTYPE_TIDY_unfilled <- stormdata1$EVTYPE_TIDY[stormdata1$EVTYPE_TIDY == ""]
for(i in 1:length(eventtype)) {
EVTYPE_TIDY_unfilled[b == eventtype[i]] <- eventtype[i]
}
stormdata1$EVTYPE_TIDY[which(stormdata1$EVTYPE_TIDY == "")] <- EVTYPE_TIDY_unfilled
levels(factor(stormdata1$EVTYPE_TIDY))
## [1] "" "ASTRONOMICAL LOW TIDE"
## [3] "AVALANCHE" "BLIZZARD"
## [5] "COASTAL FLOOD" "COLD/WIND CHILL"
## [7] "DENSE FOG" "DENSE SMOKE"
## [9] "DROUGHT" "DUST DEVIL"
## [11] "DUST STORM" "EXCESSIVE HEAT"
## [13] "EXTREME COLD/WIND CHILL" "FLASH FLOOD"
## [15] "FLOOD" "FREEZING FOG"
## [17] "FROST/FREEZE" "FUNNEL CLOUD"
## [19] "HAIL" "HEAT"
## [21] "HEAVY RAIN" "HEAVY SNOW"
## [23] "HIGH SURF" "HIGH WIND"
## [25] "HURRICANE (TYPHOON)" "ICE STORM"
## [27] "LAKE-EFFECT SNOW" "LAKESHORE FLOOD"
## [29] "LIGHTNING" "MARINE HAIL"
## [31] "MARINE HIGH WIND" "MARINE STRONG WIND"
## [33] "MARINE THUNDERSTORM WIND" "RIP CURRENT"
## [35] "SEICHE" "SLEET"
## [37] "STORM SURGE/TIDE" "STRONG WIND"
## [39] "THUNDERSTORM WIND" "TORNADO"
## [41] "TROPICAL DEPRESSION" "TROPICAL STORM"
## [43] "TSUNAMI" "VOLCANIC ASH"
## [45] "WATERSPOUT" "WILDFIRE"
## [47] "WINTER STORM" "WINTER WEATHER"
After this operation, there are still 2848 unclassified observations. Forget about these, as there are entries that did not correspond to identifiable disaster events (Reports, summaries, regular phenomena).
Subsetted data to contain only allowed events (removed empty EVTYPE_TIDY observations).
stormdata2 <- stormdata1[stormdata1$EVTYPE_TIDY != "", ]
stormdata2$EVTYPE_TIDY <- factor(stormdata2$EVTYPE_TIDY)
Health damage: Taking population health damage to be fatalities and injuries, calculated fatalities, injuries and their sums for each type of event.
fatalsum <- as.numeric(by(stormdata2$FATALITIES, stormdata2$EVTYPE_TIDY, sum))
injsum <- as.numeric(by(stormdata2$INJURIES, stormdata2$EVTYPE_TIDY, sum))
health_dmg <- fatalsum + injsum
Property damage: Absolute property damage dollar values were given by PROPDMG * PROPDMGEXP, where PROPDMGEXP give exponents of 10^3, 10^6 and 10^9 for “K”, “M” and “B” respectively. Crop damage was not included, for CROPDMGEXP values are missing.
property_dmg <- numeric(length(stormdata2$PROPDMG))
for(i in 1:length(stormdata2$PROPDMG)) {
if(stormdata2$PROPDMGEXP[i] == "") {
property_dmg[i] <- stormdata2$PROPDMG[i] * 1
}
if(stormdata2$PROPDMGEXP[i] == "K") {
property_dmg[i] <- stormdata2$PROPDMG[i] * 1000
}
if(stormdata2$PROPDMGEXP[i] == "") {
property_dmg[i] <- stormdata2$PROPDMG[i] * 1000000
}
if(stormdata2$PROPDMGEXP[i] == "") {
property_dmg[i] <- stormdata2$PROPDMG[i] * 1000000000
}
}
stormdata2 <- cbind(stormdata2, length(property_dmg))
colnames(stormdata2)[ncol(stormdata2)] <- "PROPDMG_FULL"
stormdata2$PROPDMG_FULL <- property_dmg
propsum <- as.numeric(by(stormdata2$PROPDMG_FULL, stormdata2$EVTYPE_TIDY, sum))
Combined health and property damage into a single dataframe.
est_dmg <- as.data.frame(cbind(levels(stormdata2$EVTYPE_TIDY),
fatalsum, injsum, health_dmg,
propsum))
colnames(est_dmg) <- c("Storm_type", "Fatalities", "Injuries", "Total_health_damage", "Total_property_dmg")
est_dmg$Storm_type <- factor(tolower(est_dmg$Storm_type)) # Note: can't plot "character" as x-axis
est_dmg$Fatalities <- as.numeric(as.character(est_dmg$Fatalities))
est_dmg$Injuries <- as.numeric(as.character(est_dmg$Injuries))
est_dmg$Total_health_damage <- as.numeric(as.character(est_dmg$Total_health_damage))
est_dmg$Total_property_dmg <- as.numeric(as.character(est_dmg$Total_property_dmg))
head(est_dmg)
## Storm_type Fatalities Injuries Total_health_damage
## 1 astronomical low tide 0 0 0
## 2 avalanche 225 170 395
## 3 blizzard 101 805 906
## 4 coastal flood 3 2 5
## 5 cold/wind chill 95 12 107
## 6 dense fog 18 342 360
## Total_property_dmg
## 1 320000
## 2 1621800
## 3 24683950
## 4 13290560
## 5 1990000
## 6 8224000
Plotted Fatalities, Injuries and their sums on the same axes.
par(las = 2, mar = c(12, 5, 4, 2)) # "las = 2" rotates x-axis 90o.
with(est_dmg, plot.default(Storm_type, Fatalities,
type = "p",
pch = 15,
xlab = "",
ylab = "Human health damages",
main = "Storm damages to human health")) # Make first plot
with(est_dmg, points(Storm_type, Injuries,
type = "p",
pch = 16,
col = "red")) # Add plot
with(est_dmg, points(Storm_type, Total_health_damage,
type = "p",
pch = 17,
col = "blue")) # Add plot
legend("topleft",
col = c("black", "red", "blue"),
lty = 1,
pch = c(15, 16, 17),
cex = 0.8,
legend = c("Fatalities", "Injuries", "Total")) # Annotate
axis(side = 1, at = 1:47, labels = as.character(est_dmg$Storm_type)) # Label tickmarks on x-axis
Higher fatalities roughly correspond to higher injuries.
Adding fatalities and injuries number is a good estimator for considering total health damage.
Make several categories for damages to human health:
est_dmg$Storm_type[est_dmg$Total_health_damage <= 1000]
## [1] astronomical low tide avalanche
## [3] blizzard coastal flood
## [5] cold/wind chill dense fog
## [7] dense smoke drought
## [9] dust devil dust storm
## [11] extreme cold/wind chill freezing fog
## [13] frost/freeze funnel cloud
## [15] heavy rain high surf
## [17] hurricane (typhoon) lake-effect snow
## [19] lakeshore flood marine hail
## [21] marine high wind marine strong wind
## [23] marine thunderstorm wind rip current
## [25] seiche sleet
## [27] storm surge/tide tropical depression
## [29] tropical storm tsunami
## [31] volcanic ash waterspout
## [33] winter weather
## 47 Levels: astronomical low tide avalanche blizzard ... winter weather
est_dmg$Storm_type[est_dmg$Total_health_damage > 1000 &
est_dmg$Total_health_damage < 5000]
## [1] flash flood hail heat heavy snow
## [5] high wind ice storm thunderstorm wind wildfire
## [9] winter storm
## 47 Levels: astronomical low tide avalanche blizzard ... winter weather
est_dmg$Storm_type[est_dmg$Total_health_damage >= 5000]
## [1] excessive heat flood lightning strong wind
## [5] tornado
## 47 Levels: astronomical low tide avalanche blizzard ... winter weather
The type of storm that causes the most amount of fatalities and injuries is:
est_dmg$Storm_type[est_dmg$Total_health_damage == max(est_dmg$Total_health_damage)]
## [1] tornado
## 47 Levels: astronomical low tide avalanche blizzard ... winter weather
Note that plotting property damage on a linear scale does not create an effective visual representation of the distribution of property damage for different event types. This is because the difference between property damage caused by different event types are of several orders of magnitude. Therefore used a logarithmic scale on the y-axis.
par(mar = c(2, 6, 4, 2))
with(est_dmg, plot.default(xy.coords(Storm_type, Total_property_dmg/1000000),
xlim = c(1,60),
ylim = c(.001, max(Total_property_dmg/1000000)),
log = "y",
pch = 15,
xlab = "", xaxt = "n", # choose not to label the tickmarks here.
ylab = "",
main = "Storm damages to properties"))
## Warning in xy.coords(x, y, xlabel, ylabel, log): 1 y value <= 0 omitted
## from logarithmic plot
text(est_dmg$Storm_type, est_dmg$Total_property_dmg/1000000, est_dmg$Storm_type,
cex=0.8, pos = 4, col="blue") # Labelled each point in the plot to effectively visualize which event caused more damages.
title(ylab = "Property damage (million $)",
mgp = c(4, 1, 0)) # Labelled y-axis
Again several categories could be created for damages to property:
est_dmg$Storm_type[log10(est_dmg$Total_property_dmg) <= 5]
## [1] dense smoke lakeshore flood marine hail rip current
## [5] sleet
## 47 Levels: astronomical low tide avalanche blizzard ... winter weather
est_dmg$Storm_type[log10(est_dmg$Total_property_dmg) > 5 &
log10(est_dmg$Total_property_dmg) < 9]
## [1] astronomical low tide avalanche
## [3] blizzard coastal flood
## [5] cold/wind chill dense fog
## [7] drought dust devil
## [9] dust storm excessive heat
## [11] extreme cold/wind chill freezing fog
## [13] frost/freeze funnel cloud
## [15] heat heavy rain
## [17] high surf high wind
## [19] hurricane (typhoon) ice storm
## [21] lake-effect snow marine high wind
## [23] marine strong wind marine thunderstorm wind
## [25] seiche storm surge/tide
## [27] tropical depression tropical storm
## [29] tsunami volcanic ash
## [31] waterspout wildfire
## [33] winter storm winter weather
## 47 Levels: astronomical low tide avalanche blizzard ... winter weather
est_dmg$Storm_type[log10(est_dmg$Total_property_dmg) >= 9]
## [1] flash flood flood hail heavy snow
## [5] lightning strong wind thunderstorm wind tornado
## 47 Levels: astronomical low tide avalanche blizzard ... winter weather
The type of storm that causes the most amount of property damage is:
est_dmg$Storm_type[log10(est_dmg$Total_property_dmg) == max(log10(est_dmg$Total_property_dmg))]
## [1] strong wind
## 47 Levels: astronomical low tide avalanche blizzard ... winter weather
To answer the questions,
Across the United States, tornadoes and winds, lightning, floods and excessive heat cause the most harm to population health, in terms of generating the highest number of fatalities and injuries.
Across the United States, tornadoes and winds, lightning, floods, snow and hail have the greatest economic consequences, in terms of generating the great amount of property damage in dollars.