Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern. Using insights of the impact by severe weather events on population health and economic consequence, government or municipal managers who are responsible for preparing for severe weather events are able to prioritize resources for different types of events.
This report provides some insights of the impact across the United States by identifying (1) the types of events which are most harmful with respect to population health, and (2) the types of events which have the greatest economic consequences. This report presents the findings based on the data in the U.s. National Oceanic and Atmospheric Administration’s (NOAA) storm database for events between the year 1950 and Nov 2011.
During the period, Tornados posted the greatest impact to human health (91,346 injuries and 5,633 fatalities), followed by Excessive Heat (1,903 fatalities and 6525 injuries), TSTM Wind (504 fatalities and 6,957 injuries).
On the economic impact, Flood accounted for the highest damage of over US$150 billions, followed by Hurricane/Typhoon (over US$71 billions) and Tornado (over US$57 billions).
In conclusion, the analysis showed that Tornadoes and Flood were the two most costly threats to the human health (measured by fatalities and injuries) and economy health (measured by damages to property and crops). Threats by Excessive Heat and Hurricane/Typhoon shall not be overlooked as they also posted significant undesirable impact to both the human health and economy health.
About NOAA storm database: It tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage. The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
# Create a data directory (if not created yet).
if (!file.exists("./data")) {
dir.create("./data")
}
# Download the data file (if not downloaded yet).
fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
if (!file.exists("./data/stormdata.csv.bz2")) {
download.file(fileUrl, destfile="./data/stormdata.csv.bz2", cacheOK = TRUE)
}
stormdata <- read.csv(bzfile("./data/StormData.csv.bz2"))
str(stormdata)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
The database contained 902,297 observations of 37 variables.
To analyze the impact of weather events to population health and economy, we will focus on the following measurements across the U.S.: * Measurements related to population health: FATALITIES, INJURIES * * Measurements related to economy consequences: PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP * * Events: Tidy the EV
data_subset <- subset(stormdata,
(stormdata$INJURIES > 0 | stormdata$FATALITIES > 0 |
stormdata$PROPDMG > 0 | stormdata$CROPDMG > 0),
select = c("STATE","COUNTY","BGN_DATE","END_DATE",
"EVTYPE","FATALITIES","INJURIES","PROPDMG",
"PROPDMGEXP","CROPDMG","CROPDMGEXP"))
str(data_subset)
## 'data.frame': 254633 obs. of 11 variables:
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
The final dataset used for the weather events studies contains 254,633 observations of 11 variables.
The variable EVTYPE records the events required for our studies. However, we observed inconsistency in the data due to typos, different abbreviations, letter capitalization standards, etc. For example: BEACH EROSIN, Beach Erosion, BEACH EROSION refer to the same event. This step performs some basic data cleansing on EVTYPE to resolve such inconsistency issue.
A new variable, ENTYPE_CLEAN, is created to store the cleansed event type.
data_subset$EVTYPE_CLEAN <- data_subset$EVTYPE
data_subset$EVTYPE_CLEAN <- sub(".*Avalanche.*", "Avalanche", data_subset$EVTYPE, ignore.case=TRUE)
data_subset$EVTYPE_CLEAN <- sub(".*Blizzard.*", "Blizzard", data_subset$EVTYPE, ignore.case=TRUE)
data_subset$EVTYPE_CLEAN <- sub(".*Cold.*", "Cold", data_subset$EVTYPE, ignore.case=TRUE)
data_subset$EVTYPE_CLEAN <- sub(".*Debris.*", "Debris", data_subset$EVTYPE, ignore.case=TRUE)
data_subset$EVTYPE_CLEAN <- sub(".*Drought.*", "Drought", data_subset$EVTYPE, ignore.case=TRUE)
data_subset$EVTYPE_CLEAN <- sub(".*Dust.*", "Dust", data_subset$EVTYPE, ignore.case=TRUE)
data_subset$EVTYPE_CLEAN <- sub(".*Depression.*", "Depression", data_subset$EVTYPE, ignore.case=TRUE)
data_subset$EVTYPE_CLEAN <- sub(".*Fog.*", "Fog", data_subset$EVTYPE, ignore.case=TRUE)
data_subset$EVTYPE_CLEAN <- sub(".*Flood.*", "Flood", data_subset$EVTYPE, ignore.case=TRUE)
data_subset$EVTYPE_CLEAN <- sub(".*Freeze.*", "Freeze", data_subset$EVTYPE, ignore.case=TRUE)
data_subset$EVTYPE_CLEAN <- sub(".*Funnel.*", "Funnel", data_subset$EVTYPE, ignore.case=TRUE)
data_subset$EVTYPE_CLEAN <- sub(".*Hurricane.*", "Hurricane/Typhoon", data_subset$EVTYPE, ignore.case=TRUE)
data_subset$EVTYPE_CLEAN <- sub(".*Hail.*", "Hail", data_subset$EVTYPE, ignore.case=TRUE)
data_subset$EVTYPE_CLEAN <- sub(".*Heat.*", "Heat", data_subset$EVTYPE, ignore.case=TRUE)
data_subset$EVTYPE_CLEAN <- sub(".*Lightning.*", "Lightning", data_subset$EVTYPE, ignore.case=TRUE)
data_subset$EVTYPE_CLEAN <- sub(".*Rain.*", "Rain", data_subset$EVTYPE, ignore.case=TRUE)
data_subset$EVTYPE_CLEAN <- sub(".*Rip.*", "Rip", data_subset$EVTYPE, ignore.case=TRUE)
data_subset$EVTYPE_CLEAN <- sub(".*Snow.*", "Snow", data_subset$EVTYPE, ignore.case=TRUE)
data_subset$EVTYPE_CLEAN <- sub(".*Smoke.*", "Smoke", data_subset$EVTYPE, ignore.case=TRUE)
data_subset$EVTYPE_CLEAN <- sub(".*Surf.*", "Surf", data_subset$EVTYPE, ignore.case=TRUE)
data_subset$EVTYPE_CLEAN <- sub(".*Seiche.*", "Seiche", data_subset$EVTYPE, ignore.case=TRUE)
data_subset$EVTYPE_CLEAN <- sub(".*Sleet.*", "Sleet", data_subset$EVTYPE, ignore.case=TRUE)
data_subset$EVTYPE_CLEAN <- sub(".*Thunderstorm.*", "Thunderstorm", data_subset$EVTYPE, ignore.case=TRUE)
data_subset$EVTYPE_CLEAN <- sub(".*tstm.*", "Thunderstorm", data_subset$EVTYPE, ignore.case=TRUE)
data_subset$EVTYPE_CLEAN <- sub(".*Tornado.*", "Tornado", data_subset$EVTYPE, ignore.case=TRUE)
data_subset$EVTYPE_CLEAN <- sub(".*Typhoon.*", "Hurricane/Typhoon", data_subset$EVTYPE, ignore.case=TRUE)
data_subset$EVTYPE_CLEAN <- sub(".*Tsunami.*", "Tsunami", data_subset$EVTYPE, ignore.case=TRUE)
data_subset$EVTYPE_CLEAN <- sub(".*Tide.*", "Tide", data_subset$EVTYPE, ignore.case=TRUE)
data_subset$EVTYPE_CLEAN <- sub(".*Volcanic.*", "Volcanic", data_subset$EVTYPE, ignore.case=TRUE)
data_subset$EVTYPE_CLEAN <- sub(".*Waterspout.*", "Waterspout", data_subset$EVTYPE, ignore.case=TRUE)
data_subset$EVTYPE_CLEAN <- sub(".*Wildfire|fire.*", "Wildfire", data_subset$EVTYPE, ignore.case=TRUE)
data_subset$EVTYPE_CLEAN <- sub(".*microburst.*", "Microburst", data_subset$EVTYPE, ignore.case=TRUE)
This step computes the property damage costs (PROPDMG) and crop damage costs (CROPDMG). To compute the cost, we multiply the value in PROPDMG and CROPDMG with the unit (K, M, B) indicated in PROPDMGEXP and CROPDMGEXP respectively, where “K” stands for thousands, “M” for millions, and “B” for billions.
New variables created:
PROPDMG_COST - property damage costs in million dollars
CROPDMG_COST - crop damage costs in million dollars
compute_cost <- function(value, unit){
# Cost is computed as value * unit, and then standardized to million dollars.
if (is.na(unit)) {
factor = 1
} else if (unit == 'k' || unit == 'K') {
factor = 1000
} else if (unit == 'm' || unit == 'M'){
factor = 1000000
} else if (unit == 'b' || unit == 'B'){
factor = 1000000000
} else {
factor = 1
}
# return computed value
return((value * factor)/1000000) # standardize to million dollars
}
data_subset$PROPDMG_COST <- mapply(compute_cost, data_subset$PROPDMG, unit=data_subset$PROPDMGEXP)
data_subset$CROPDMG_COST <- mapply(compute_cost, data_subset$CROPDMG, unit=data_subset$CROPDMGEXP)
Now, the data is ready for us to condct a data analysis to address the following questions:
1.Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
2.Across the United States, which types of events have the greatest economic consequences?
For damage related to population health, we summarize to get the total FATALITIES and total INJURIES per event type using the tidy dataset, and then find the top 10 events (EVTYPE_CLEAN) with the highest combined total fatalities and injuries.
New variables created:
fatalities - total fatalities per event type
injuries - total injuries per event type
total_health - total fatalities + injuries per event type
top10_health - top 10 events
# Aggregate fatalities by event type
fatalities <- aggregate(data_subset$FATALITIES, by=list(data_subset$EVTYPE_CLEAN), sum)
names(fatalities) <- c("event","fatalities")
# Aggregate injuries by event type
injuries <- aggregate(data_subset$INJURIES, by=list(data_subset$EVTYPE_CLEAN), sum)
names(injuries) <- c("event","injuries")
# Aggregate fatalities+injuries by event type
total_health <- aggregate(data_subset$FATALITIES+data_subset$INJURIES, by=list(data_subset$EVTYPE), sum)
names(total_health) <- c("event","total_health")
# Merge results
health_merged <- merge(fatalities, injuries)
health_merged <- merge(health_merged, total_health)
# Get top 10
top10_health <- head(health_merged[order(health_merged$total_health, decreasing = TRUE),],10)
For damage related to economy, we summarize to get the total property damage (PROPDMG_COST) and total crop damage (CROPDMG_COST) using the tidy dataset, and then find the top 10 events (EVTYPE_CLEAN) with the highest combined total damage costs.
New variables created:
property_damage - total property damage per event type
crop_damage - total crop damage per event type
total_damage - total property damage + crop damage per event type
top10_damage - top 10 events
# Aggregate property damage by event type
property_damage <- aggregate(data_subset$PROPDMG_COST, by=list(data_subset$EVTYPE_CLEAN), sum)
names(property_damage) <- c("event","property_damage")
# Aggregate crop damage by event type
crop_damage <- aggregate(data_subset$CROPDMG_COST, by=list(data_subset$EVTYPE), sum)
names(crop_damage) <- c("event","crop_damage")
# Aggregate property damage and crop damage by event type
total_damage <- aggregate(data_subset$PROPDMG_COST+data_subset$CROPDMG_COST,
by=list(data_subset$EVTYPE_CLEAN), sum)
names(total_damage) <- c("event","total_damage")
# Merge results
damage_merge <- merge(property_damage, crop_damage)
damage_merge <- merge(damage_merge, total_damage)
# Get top 10 events
top10_damage <- head(damage_merge[order(damage_merge$total_damage, decreasing = TRUE),],10)
This section lists the top 10 storm events with the most harmful impact to population health in terms of fatalities and injuries (observations made between 1950 to 2011).
From List 1, we observed that Tornado accounted for the maximum number of health impact (5,633 fatalities and 91,346 injuries), followed by Excessive Heat (1,903 fatalities and 6525 injuries), TSTM Wind (504 fatalities and 6,957 injuries) and so forth.
# List the top 10 events
top10_health
## event fatalities injuries total_health
## 404 TORNADO 5633 91346 96979
## 60 EXCESSIVE HEAT 1903 6525 8428
## 420 TSTM WIND 504 6957 7461
## 85 FLOOD 470 6789 7259
## 257 LIGHTNING 816 5230 6046
## 150 HEAT 937 2100 3037
## 72 FLASH FLOOD 978 1777 2755
## 237 ICE STORM 89 1975 2064
## 361 THUNDERSTORM WIND 133 1488 1621
## 477 WINTER STORM 206 1321 1527
# Melt the data to plot the breakdown by fatalities, injuries and total.
library(reshape2)
## Warning: package 'reshape2' was built under R version 3.1.2
top10_health_melted <- melt(top10_health, id=c("event"),
measure.var=c("fatalities", "injuries", "total_health"))
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.1.2
# generate the plot
ggplot(top10_health_melted, aes(event, value)) +
geom_bar(aes(fill=variable), stat="identity", position="dodge") +
xlab("Events") +
ylab("Populations") +
ggtitle("Top 10 Events Most Harmful to Population Health") +
scale_fill_manual(values=c("#F0E442", "#0072B2", "#D55E00"),
labels=c("Fatalities", "Injuries", "Total")) +
theme(axis.text.x = element_text(angle=90))
From the trends displayed in Figure 1, we observed that Tornado posted the greatest threat to population health when comparing the magnitude of injuries against that caused by the other events. Though the magnitude of fatalities caused by Toranto was relatively lower than injuries, it was still the highest amongst all other events.
This section lists the top 10 storm events with the greatest economic consequences measured by the cost of property damage and crop damage for the observations made between 1950 to 2011.
From List 2, we observed that flood accounted for the highest damage of US$150,319 millions (or over US$150 billion), followed by Hurricane/Typhoon (total damage of US$71,913 millions or over US$71 billions) and Tornado (total damage of US$57,352 million or over US$57 billions).
# List the top 10 events
top10_damage
## event property_damage crop_damage total_damage
## 85 FLOOD 144657.710 5661.9685 150319.678
## 223 HURRICANE/TYPHOON 69305.840 2607.8728 71913.713
## 404 TORNADO 56937.161 414.9533 57352.114
## 347 STORM SURGE 43323.536 0.0050 43323.541
## 133 HAIL 15732.267 3025.9545 18758.222
## 72 FLASH FLOOD 16140.812 1421.3171 17562.129
## 49 DROUGHT 1046.106 13972.5660 15018.672
## 214 HURRICANE 11868.319 2741.9100 14610.229
## 307 RIVER FLOOD 5118.945 5029.4590 10148.405
## 237 ICE STORM 3944.928 5022.1135 8967.041
# Melt the data to plot the breakdown by property damage, crop damage, and total.
library(reshape2)
top10_damage_melted <- melt(top10_damage, id=c("event"),
measure.var=c("property_damage", "crop_damage", "total_damage"))
library(ggplot2)
# generate the plot
ggplot(top10_damage_melted, aes(event, value)) +
geom_bar(aes(fill=variable), stat="identity", position="dodge") +
xlab("Events") +
ylab("Damages (US$ in Million)") +
ggtitle("Top 10 Events with greatest economic consequences") +
scale_fill_manual(values=c("#F0E442", "#0072B2", "#D55E00"),
labels=c("Property", "Crop", "Total")) +
theme(axis.text.x = element_text(angle=90))
From the trends displayed in Figure 2, we observed that Flood caused the greatest loss in terms of property damage and crop damage, followed by Hurricane/Typhoon and Tornado. In comparison of damages between property and crop, flooding was the most damaging to property, with over $144 billions in reported damages (vs total damage of over $150 billions).