Storms and other severe weather events can cause both public health (fatalities, injuries) and economic problems (property, crop damage) for a population. This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This documents the occurrence of storms and other significant weather phenomena having sufficient intensity to cause loss of life, injuries, significant property damage, and/or disruption to commerce.
Tornado seems to have the greatest health impact both in terms of fatalities and injuries. It is followed by Thunderstorm Wind in terms of injuries, and Excessive Heat in terms of fatalities. The top ten causes of injuries and fatalities are largely the same.
Flood has caused the most property damage while Drought has caused the most crop damage. What a contrast!!
library(dplyr)
library(ggplot2)
library(reshape2)
fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
if(!file.exists("data")) {
dir.create("data")
download.file(fileUrl, destfile = "data/stormdata.csv.bz2")
}
## A total of 902,297 observations
sdata_f <- read.csv("data/stormdata.csv.bz2")
Subset the observations which have health or economic impacts. The original database has 902,297 observations.
################################################################################
##
## We don't need data where there are no fatalities/injuries or economic impact
## Left with 254,633 observations only (647,664 less observations to worry)
##
################################################################################
sdata <- subset(sdata_f,
FATALITIES > 0 | INJURIES > 0 | PROPDMG > 0 | CROPDMG > 0)
After filtering out the non-impact observation, we are left with only 254,633 observations!! Significantly less observations to analyze.
The storm data has been transformed using the National Weather Service Instruction 10-1605, August 17, 2007\(^{[1]}\). The events in the database had to be conformed to the permitted storm data events. Some of the instructions to highlight are:
There are a lot of free flow texts not conforming to the permitted storm data events appear in the database including the use of abbreviations and spelling mistakes. The best effort has been made to transform all these transcription errors to the permitted events. Below is the transformation table that has been used for the same.
################################################################################
##
## Create a substitution list containing "from" and "to" columns. The "from"
## contains the regular expressions that will be "grep"ed in the database and
## will be substituted by the permitted storm data even in the respective "to"
## list.
##
################################################################################
subsList <- rbind(c("TSTM|THUN|TUND|THUD|BURST", "THUNDERSTORM WIND"),
c("STORM|GUSTNADO", "THUNDERSTORM WIND"),
c("FIRE", "WILDFIRE"),
c("WINTRY MIX|RAIN/SNOW", "SLEET"),
c("HEAVY MIX|MIXED PREC", "SLEET"),
c("URBAN|HEAVY RAIN", "HEAVY RAIN"),
c("HEAVY SWE|HEAVY SEA", "HEAVY RAIN"),
c("HEAVY SHO|^RAIN(.*)", "HEAVY RAIN"),
c("RECORD(.*)RAIN|UNSE(.*)RAIN", "HEAVY RAIN"),
c("HVY(.*)RAIN|TORR(.*)RAIN", "HEAVY RAIN"),
c("HEAVY PREC|EXCES(.*)RAIN", "HEAVY RAIN"),
c("HURRICANE|TYPHOON", "HURRICANE(TYPHOON)"),
c("HIGH WIND|^WIND(.*)", "HIGH WIND"),
c("TORN|WHIRL", "TORNADO"),
c("FLASH|RIVER|DAM BREAK", "FLASH FLOOD"),
c("MAJ(.*)FLOOD|RUR(.*)FLOOD", "FLASH FLOOD"),
c("STR(.*)FLOOD|LAK(.*)FLOOD", "LAKESHORE FLOOD"),
c("COASTAL", "COASTAL FLOOD"),
c("^FLOOD|FLOODING", "FLOOD"),
c("HAIL", "HAIL"),
c("BLIZZARD", "BLIZZARD"),
c("EXTENDED(.*)COLD", "EXTREME COLD/WIND CHILL"),
c("EXTREME(.*)COLD", "EXTREME COLD/WIND CHILL"),
c("EXTREME(.*)WIND", "EXTREME COLD/WIND CHILL"),
c("RECORD(.*)COLD", "EXTREME COLD/WIND CHILL"),
c("UNSEAS(.*)COLD|HYPOTHERM", "EXTREME COLD/WIND CHILL"),
c("FOG", "DENSE FOG"),
c("STORM SURGE", "STORM SURGE/TIDE"),
c("HEAVY(.*)SNOW|RECORD SNOW", "HEAVY SNOW"),
c("^SNOW(.*)|BLOWING(.*)SNOW", "HEAVY SNOW"),
c("LATE(.*)SNOW|EXCE(.*)SNOW", "HEAVY SNOW"),
c("SLIDE|LAND|EROS", "DEBRIS FLOW"),
c("GUSTY|STRONG WIND", "STRONG WIND"),
c("HYPERTH|UNS(.*)WARM", "EXCESSIVE HEAT"),
c("RECORD HEAT|EXCE(.*)HEAT", "EXCESSIVE HEAT"),
c("EXTREME HEAT", "EXCESSIVE HEAT"),
c("HIGH TIDE", "STORM SURGE/TIDE"),
c("LIGHT SNOW|LAKE(.*)SNOW", "LAKE-EFFECT SNOW"),
c("TROPICAL STORM", "TROPICAL STORM"),
c("HEAT WAVE|WARM", "HEAT"),
c("WINTER STORM", "WINTER STORM"),
c("ICE", "ICE STORM"),
c("SURF|^HIGH(.*)S|^HIGH(.*)W", "HIGH SURF"),
c("SLEET", "SLEET"),
c("FREEZ|ICY ROADS|FROST", "FROST/FREEZE"),
c("WINTER WEATHER|SQUALL", "WINTER WEATHER"),
c("WATERSPOUT", "WATERSPOUT"),
c("^COLD(.*)|LOW(.*)TEMP", "COLD/WIND CHILL"),
c("RIP CURRENT", "RIP CURRENT"),
c("GLAZE", "FREEZING FOG"),
c("GRADIENT WIND", "TROPICAL DEPRESSION"),
c("LIG", "LIGHTNING"),
c("DUST", "DUST STORM"),
c("MARINE|ROUGH SEAS|ROGUE WAV", "MARINE STRONG WIND"),
c("AVALAN", "AVALANCHE")
)
colnames(subsList) <- c("from", "to")
for(i in 1:nrow(subsList)) {
sdata$EVTYPE[grep(subsList[i, "from"],
sdata$EVTYPE, ignore.case = TRUE)] <- subsList[i, "to"]
}
Let’s first examine the leading causes of deaths.
################################################################################
##
## Filter out the fatalities data, group them by events, and get the total
## fatalities for each of these groups.
##
################################################################################
fatalities_t <-
sdata %>%
select(EVTYPE, FATALITIES) %>%
group_by(EVTYPE) %>%
summarise(fatalities = sum(FATALITIES))
The top ten causes of death are:
################################################################################
##
## Arrange the fatalities table in decreasing order of fatalities and display
## the first 10 observations.
##
################################################################################
arrange(fatalities_t, -fatalities)[1:10, ]
## # A tibble: 10 x 2
## EVTYPE fatalities
## <chr> <dbl>
## 1 TORNADO 5637
## 2 EXCESSIVE HEAT 2061
## 3 THUNDERSTORM WIND 1182
## 4 HEAT 1118
## 5 FLASH FLOOD 1040
## 6 LIGHTNING 817
## 7 RIP CURRENT 572
## 8 HIGH SURF 496
## 9 FLOOD 477
## 10 EXTREME COLD/WIND CHILL 316
What are the leading causes of injuries?
################################################################################
##
## Filter out the injuries data, group them by events, and get the total
## injuries for each of these groups.
##
################################################################################
injuries_t <-
sdata %>%
select(EVTYPE, INJURIES) %>%
group_by(EVTYPE) %>%
summarise(injuries = sum(INJURIES))
The top ten causes of injuries are:
################################################################################
##
## Arrange the injuries table in decreasing order of injuries and display the
## first 10 observations.
##
################################################################################
arrange(injuries_t, -injuries)[1:10, ]
## # A tibble: 10 x 2
## EVTYPE injuries
## <chr> <dbl>
## 1 TORNADO 91407
## 2 THUNDERSTORM WIND 13787
## 3 FLOOD 6792
## 4 EXCESSIVE HEAT 6747
## 5 LIGHTNING 5231
## 6 HEAT 2496
## 7 HIGH SURF 1849
## 8 FLASH FLOOD 1803
## 9 WILDFIRE 1608
## 10 HAIL 1371
The number of injuries caused by Tornado is much higher than others. Let’s note this figure as this is kind of an outlier compared to others.
################################################################################
##
## Note the unusually higher number of injuries caused by TORNADO. This is kind
## of an outlier compared to others.
##
################################################################################
tornado_inj <- filter(injuries_t, EVTYPE =="TORNADO")[["injuries"]]
We notice that the top ten causes of injuries and deaths are roughly the same. So we will merge the two tables - fatalities and injuries and plot them together.
################################################################################
##
## Merge the two tables - fatalities and injuries. Melt the top ten observations
## so that both the fatalities and injuries can be plotted on to the same plot.
##
################################################################################
health_impact <-
merge(fatalities_t, injuries_t) %>%
arrange(-fatalities)
health_impact_plot <- melt(health_impact[1:10, ])
## Using EVTYPE as id variables
g <- ggplot(health_impact_plot, aes(reorder(EVTYPE, -value),
value, fill = variable))
The injuries caused by Tornado was an outlier. So plot the graph, zooming in to other observations and just labeling the Tornado injuries. This gives a better graph which would otherwise would have been squished just because of one data point.
################################################################################
##
## Since the injuries caused by tornado is an outlier, we will zoom in to other
## observations by limiting the "ylim" of the plot and just label the outlier.
##
################################################################################
g + geom_bar(stat = "identity", position = "dodge") +
coord_cartesian(ylim = c(0, 15000)) +
geom_text(x = 1.25, y = 15400, label = tornado_inj, col = "brown") +
scale_x_discrete(guide = guide_axis(angle = 60)) +
labs(fill = "Impact Type") +
labs(x = "", y = "Health Impact on the Population") +
labs(title = "Health Impact of Various Natural Calamities on the Population")
################################################################################
##
## Select only the economic impact related columns from the database
##
################################################################################
damages <-
sdata %>%
select(EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)
################################################################################
##
## The damages PROPDMG and CROPDMG are supplemented with PROPDMGEXP and
## CROPDMGEXP respectively. These contain the expressions (like K, M, B, etc.)
## to denote by how much should the ?ROPDMG value be multiplied to get the
## effective damage values.
##
## Get all the unique expressions that are possible and sort them
##
################################################################################
dmg_exp <-
sdata$PROPDMGEXP %>%
unique %>%
sort
################################################################################
##
## These unique expressions correspond to these numbers that needs to be
## multiplied.
##
################################################################################
mult <- c(1, 1, 1, 10^0, 10^2, 10^3, 10^4, 10^5, 10^6,
10^7, 10^9, 10^2, 10^2, 10^3, 10^6, 10^6)
The expression and there corresponding multipliers are:
mult_table <- data.frame(dmg_exp, mult)
mult_table
## dmg_exp mult
## 1 1e+00
## 2 - 1e+00
## 3 + 1e+00
## 4 0 1e+00
## 5 2 1e+02
## 6 3 1e+03
## 7 4 1e+04
## 8 5 1e+05
## 9 6 1e+06
## 10 7 1e+07
## 11 B 1e+09
## 12 h 1e+02
## 13 H 1e+02
## 14 K 1e+03
## 15 m 1e+06
## 16 M 1e+06
################################################################################
##
## Substitute the expression with their corresponding multiplier values for both
## property and crop related expressions.
##
################################################################################
damages$PROPDMGEXP <- mult_table$mult[match(damages$PROPDMGEXP,
mult_table$dmg_exp)]
damages$CROPDMGEXP <- mult_table$mult[match(damages$CROPDMGEXP,
mult_table$dmg_exp)]
################################################################################
##
## Calculate the total property and crop damage by multiplying the values with
## corresponding multipliers grouped by event. The damage values are calculated
## in USD billions.
##
################################################################################
economic_impact <-
damages %>%
mutate(PROPDMG = (PROPDMG * PROPDMGEXP)) %>%
mutate(CROPDMG = (CROPDMG * CROPDMGEXP)) %>%
group_by(EVTYPE) %>%
summarise(propdmg = sum(PROPDMG)/(10^9), cropdmg = sum(CROPDMG)/(10^9))
## `summarise()` ungrouping output (override with `.groups` argument)
Get the leading cause of property damage and crop damage.
######################################################
## Get the leading cause of property and crop damage
#####################################################
arrange(economic_impact, -propdmg)[1:10, ]
## # A tibble: 10 x 3
## EVTYPE propdmg cropdmg
## <chr> <dbl> <dbl>
## 1 FLOOD 145. 5.79
## 2 HURRICANE(TYPHOON) 85.4 5.52
## 3 THUNDERSTORM WIND 79.4 NA
## 4 TORNADO 57.0 0.415
## 5 FLASH FLOOD 22.9 NA
## 6 HAIL 15.7 NA
## 7 WILDFIRE 8.50 0.403
## 8 HIGH SURF 6.11 NA
## 9 HEAVY RAIN 3.34 0.816
## 10 DROUGHT 1.05 14.0
arrange(economic_impact, -cropdmg)[1:10, ]
## # A tibble: 10 x 3
## EVTYPE propdmg cropdmg
## <chr> <dbl> <dbl>
## 1 DROUGHT 1.05 14.0
## 2 FLOOD 145. 5.79
## 3 HURRICANE(TYPHOON) 85.4 5.52
## 4 FROST/FREEZE 0.0283 2.00
## 5 EXTREME COLD/WIND CHILL 0.133 1.36
## 6 HEAVY RAIN 3.34 0.816
## 7 EXCESSIVE HEAT 0.00787 0.497
## 8 TORNADO 57.0 0.415
## 9 HEAT 0.0125 0.407
## 10 WILDFIRE 8.50 0.403
We notice that the leading cause of property damage and crop damage are different. So it’s better to examine them separately.
################################################################################
##
## Extract just the property damage figures, arrange them in descending order of
## impact, and plot it.
##
################################################################################
propdmg_t <-
economic_impact %>%
select(EVTYPE, propdmg) %>%
arrange(-propdmg)
g <- ggplot(propdmg_t[1:10, ], aes(reorder(EVTYPE, -propdmg),
propdmg, fill = EVTYPE))
g + geom_bar(stat = "identity") +
scale_x_discrete(guide = guide_axis(angle = 60)) +
theme(legend.position = "none") +
labs(x = "", y = "Property Damage (in USD Billion)") +
labs(title = "Impact of Various Natural Calamities on Property")
The leading cause of property damage is Flood followed by Hurricane(Typhoon) and Thunderstorm Wind.
################################################################################
##
## Extract just the crop damage figures, arrange them in descending order of
## impact, and plot it.
##
################################################################################
cropdmg_t <-
economic_impact %>%
select(EVTYPE, cropdmg) %>%
arrange(-cropdmg)
g <- ggplot(cropdmg_t[1:10, ], aes(reorder(EVTYPE, -cropdmg), cropdmg, fill = EVTYPE))
g + geom_bar(stat = "identity") +
scale_x_discrete(guide = guide_axis(angle = 60)) +
theme(legend.position = "none") +
labs(x = "", y = "Crop Damage (in USD Billion)") +
labs(title = "Impact of Various Natural Calamities on Crop")
The leading cause of property damage is Drought followed by Flood and Hurricane(Typhoon).
Tornado seems to have the greatest health impact both in terms of fatalities and injuries. It is followed by Thunderstorm Wind in terms of injuries, and Excessive Heat in terms of fatalities. The top ten causes of injuries and fatalities are largely the same.
Flood has caused the most property damage while Drought has caused the most crop damage. What a contrast!!