In this report we aim to explore and find which of the severe weather events cause most population health and economic impacts in the United States. Our overall hypothesis is that severe weather event tornados cause the most damage in terms of human life and property. To investigate this hypothesis, we obtained the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. It tracks characteristics of major storms and weather events in the United States between the years 1950-2011, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage. There was significant discrepancy in data, in terms of event classification but data became more complete in the recent years. After processing and analyzing data we came to conclusion that tornados cause the most damage, followed by heat and floods.
From the course web site, NOAA storm data is obtained. There is also additional documentation of constructed/defined variables of the database available.
National Weather Service Storm Data Documentation
National Climatic Data Center Storm Events FAQ
library(dplyr)
library(stringr)
library(ggplot2)
library(gridExtra)
We first read in the data in the form of a comma-separated-value file compressed via the bzip2 algorithm, obtained from Coursera Website.
sds <- read.csv("repdata_data_StormData.csv.bz2", header = TRUE, sep = ",", quote = "\"")
After reading in the data, we do preliminary dataset exploring.
dim(sds)
## [1] 902297 37
colnames(sds)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
We are interested in column ‘EVTYPE’ that lists event types that have been entered from 1950 onwards.
str(unique(sds$EVTYPE))
## Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 856 244 201 629 429 657 972 410 786 ...
Below is a list obtained from Storm Data Event table 2.1.1 (page 6) of NOAA Storm data, which states 48 severe weather types.
offEvType <- toupper(c("Astronomical Low Tide", "Avalanche", "Blizzard", "Coastal Flood", "Cold/Wind Chill",
"Debris Flow", "Dense Fog", "Dense Smoke", "Drought", "Dust Devil", "Dust Storm",
"Excessive Heat", "Extreme Cold/Wind Chill", "Flash Flood", "Flood", "Freezing Fog",
"Frost/Freeze", "Funnel Cloud", "Hail", "Heat", "Heavy Rain", "Heavy Snow", "High Surf",
"High Wind", "Hurricane (Typhoon)", "Ice Storm", "Lake-Effect Snow", "Lakeshore Flood",
"Lightning", "Marine Hail", "Marine High Wind", "Marine Strong Wind",
"Marine Thunderstorm Wind", "Rip Current", "Seiche", "Sleet", "Storm Surge/Tide",
"Strong Wind", "Thunderstorm Wind", "Tornado", "Tropical Depression", "Tropical Storm",
"Tsunami", "Volcanic Ash", "Waterspout", "Wildfire", "Winter Storm", "Winter Weather"))
str(offEvType)
## chr [1:48] "ASTRONOMICAL LOW TIDE" "AVALANCHE" "BLIZZARD" ...
The storm dataset (sds) contains 985 types of EVTYPEs, whereas the official EVTYPEs are only 48. On further exploring we found out that there are upper/lowercase variations, spelling mistakes, different wordings for the same kind of category. For example: “Thunderstorm Wind” has been entered in many ways, such as : TSTM, THUNDERSTORM, THUNDERSTORM WINDS, TSTM WIND, TSTM WIND (G45),THUNDERSTORM WINDSS etc.
To map event types to official list and to clean up data to an extent, we perform transformations below.
The entries ‘Summary ~’ are taken out, as there are no fatalities/injuries/damages reported.
t <- grep(pattern = "Summary", x = sds$EVTYPE, value = TRUE, ignore.case = TRUE)
nrow(sds[(sds$EVTYPE %in% t), ])
## [1] 76
#Subsetting
sub <- sds[!(sds$EVTYPE %in% t), ]
We add a new column ‘EVTYPEmapped’ to assign category listed in the official list ‘offEvType’
We start with copying same values, then converting all to uppercase and trimming blank spaces.
sub$EVTYPEmapped <- sub$EVTYPE
sub$EVTYPEmapped <- toupper(sub$EVTYPEmapped)
sub$EVTYPEmapped <- str_trim(sub$EVTYPEmapped)
This reduces the categories down by 100 approximately.
str(unique(sub$EVTYPEmapped))
## chr [1:823] "TORNADO" "TSTM WIND" "HAIL" "FREEZING RAIN" ...
Now we perform string matching and subsititution
sub[grep("^AVAL", sub$EVTYPEmapped, ignore.case=TRUE), "EVTYPEmapped"] <- "AVALANCHE"
sub[grep("^TSTM|^THUN\\.*", sub$EVTYPEmapped, ignore.case=TRUE), "EVTYPEmapped"] <- "THUNDERSTORM WIND"
sub[grep("^HURRICANE|^TYPHOON\\.*", sub$EVTYPEmapped, ignore.case=TRUE), "EVTYPEmapped"] <- "HURRICANE (TYPHOON)"
sub[grep("^TROPICAL S\\.*", sub$EVTYPEmapped, ignore.case=TRUE), "EVTYPEmapped"] <- "TROPICAL STORM"
sub[grep("^TORN\\.*", sub$EVTYPEmapped, ignore.case=TRUE), "EVTYPEmapped"] <- "TORNADO"
sub[grep("^HIGH WIND\\.*", sub$EVTYPEmapped, ignore.case=TRUE), "EVTYPEmapped"] <- "HIGH WIND"
sub[grep("^COASTAL FLOOD\\.*", sub$EVTYPEmapped, ignore.case=TRUE), "EVTYPEmapped"] <- "COASTAL FLOOD"
sub[grep("^FLASH FLOOD\\.*", sub$EVTYPEmapped, ignore.case=TRUE), "EVTYPEmapped"] <- "FLASH FLOOD"
sub[grep("^EXTREME COLD|^EXTREME WIND\\.*)", sub$EVTYPEmapped, ignore.case=TRUE), "EVTYPEmapped"] <- "EXTREME COLD/WIND CHILL"
sub[grep("^WINTER ST\\.*", sub$EVTYPEmapped, ignore.case=TRUE), "EVTYPEmapped"] <- "WINTER STORM"
sub[grep("^WINTER WEATHER\\.*", sub$EVTYPEmapped, ignore.case=TRUE), "EVTYPEmapped"] <- "WINTER WEATHER"
sub[grep("^HEAT\\.*", sub$EVTYPEmapped, ignore.case=TRUE), "EVTYPEmapped"] <- "HEAT"
sub[grep("^RIP CURR\\.*", sub$EVTYPEmapped, ignore.case=TRUE), "EVTYPEmapped"] <- "RIP CURRENT"
sub[grep("^HAIL\\.*", sub$EVTYPEmapped, ignore.case=TRUE), "EVTYPEmapped"] <- "HAIL"
This reduces the categories down by another 200 approximately.
str(unique(sub$EVTYPEmapped))
## chr [1:599] "TORNADO" "THUNDERSTORM WIND" "HAIL" ...
These above transformations were clear, so direct mapping was straightforward. However, there are many more categories where correctly mapping is not that straightforward. For example, FLOOD/FLASH/FLOOD, is it FLASH FLOOD or FLOOD category? BLIZZARD/HEAVY SNOW, HEAVY RAIN/FLOODING and many other events pose such mapping problem. These may need to be addressed case by case researching the remarks entered and detailed understanding of each event type as explained in the Storm data documentation. Due to this limitation, henceforth we focus on filtering data based on other criterias.
In further analysis to answer the research question, we will filter out rows that have no human and property damage. This reduces the original dataset by one fourth.
nrow(sub)
## [1] 902221
sub <- sub[!(sub$FATALITIES == 0 & sub$INJURIES == 0 & sub$PROPDMG == 0.00 & sub$CROPDMG == 0 ), ]
nrow(sub)
## [1] 254633
Now we subset only the columns relevant to our research
cols <-
c("BGN_DATE", "EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG" , "CROPDMGEXP", "REFNUM", "EVTYPEmapped" )
s <- sub[, cols]
We then extract from column BGN_DATE, year only, and save it in a new column ‘YEAR’
s$YEAR <-as.numeric(format((as.Date(s$BGN_DATE, format = "%m/%d/%Y")), "%Y"))
colnames(s)
## [1] "BGN_DATE" "EVTYPE" "FATALITIES" "INJURIES"
## [5] "PROPDMG" "PROPDMGEXP" "CROPDMG" "CROPDMGEXP"
## [9] "REFNUM" "EVTYPEmapped" "YEAR"
In order to show which of the weather events had the most impact on population, we first compute the total of fatalities and injuries for each event from 1950-2011.
total <- s %>% group_by(EVTYPEmapped) %>% summarise(Fatalities = sum(FATALITIES, na.rm=TRUE), Injuries = sum(INJURIES, na.rm = TRUE))
nrow(total)
## [1] 306
If there no fatalities and injuries, we can eliminate these events from our data frame.
total <- total[which(total$Fatalities > 0 | total$Injuries > 0), ]
nrow(total)
## [1] 151
Now we subset the data to include only top 10 events, with highest fatalities and injuries. First we sort and then subset events for fatalities and injuries separately. Then we merge the two lists.
top10_F_Events <- total[order(-total$Fatalities), ]
top10_F_Events <- top10_F_Events[1:10, c("EVTYPEmapped")]
top10_I_Events <- total[order(-total$Injuries), ]
top10_I_Events <- top10_I_Events[1:10, c("EVTYPEmapped")]
mergeTop <- merge(top10_F_Events, top10_I_Events, all =TRUE)
plotData <- total[(total$EVTYPEmapped %in% mergeTop$EVTYPEmapped), ]
plotData[order(-plotData$Fatalities, -plotData$Injuries), ]
## Source: local data frame [12 x 3]
##
## EVTYPEmapped Fatalities Injuries
## 1 TORNADO 5658 91364
## 2 EXCESSIVE HEAT 1903 6525
## 3 HEAT 1118 2494
## 4 FLASH FLOOD 1018 1785
## 5 LIGHTNING 816 5230
## 6 THUNDERSTORM WIND 712 9509
## 7 RIP CURRENT 577 529
## 8 FLOOD 470 6789
## 9 HIGH WIND 293 1471
## 10 EXTREME COLD/WIND CHILL 287 255
## 11 ICE STORM 89 1975
## 12 HAIL 15 1361
Now we plot scatter plot between the events and totals for fatalities and injuries
ggplot() +
ggtitle("Number of fatalities/injuries from Severe Weather Events from 1950-2011") +
xlab("Total") +
ylab("Event Type" ) +
geom_point(data = plotData, aes(x = Fatalities, y = EVTYPEmapped, color = "Fatalities")) +
geom_point(data = plotData, aes(x = Injuries, y = EVTYPEmapped, color = "Injuries")) +
scale_color_discrete() +
theme(legend.title = element_blank()) +
theme(axis.text.x = element_text(colour = "dodgerblue4", size = 9),
axis.text.y = element_text(colour = "dodgerblue4", size = 9),
axis.title.x = element_text(colour = "coral4", size = 11),
axis.title.y = element_text(colour = "darkgreen", size = 11),
plot.title = element_text(colour = "gray0", size = 12, vjust = 1.5))
From the plot we observe that events Tornados, Excessive Heat, floods cause most fatalities and injuries. Tornados have the most negative impact on population health and has caused over 5500 deaths and over 90000 injuries between 1950-2011.
Performing time series plotting for each top event, shown below, we observe that human loss/injuries from tornados over the years have been significantly higher than other events, supporting above assertion.
plotData2 <- s %>% group_by(YEAR, EVTYPEmapped) %>% summarise(Fatalities = sum(FATALITIES, na.rm=TRUE), Injuries = sum(INJURIES, na.rm = TRUE))
plotData2 <- plotData2[which(plotData2$Fatalities > 0 | plotData2$Injuries > 0), ]
plotData2 <- plotData2[(plotData2$EVTYPEmapped %in% mergeTop$EVTYPEmapped), ]
p <- ggplot() +
ylab("") +
labs(title = "Fatalities/Injuries over period 1950-2011") +
theme(legend.title = element_blank()) +
geom_line(data = plotData2, aes(x = YEAR, y = Fatalities, color = "Fatalities")) +
geom_line(data = plotData2, aes(x = YEAR, y = Injuries, color = "Injuries")) +
theme(axis.text.x = element_text(angle = 90, size = 8, hjust = 0.5, vjust = 0.5)) +
theme(strip.text = element_text(size = 7)) +
theme(plot.title = element_text(colour = "gray0", size = 12, vjust = 1.5)) +
theme(legend.justification = c(0, 1), legend.position = c(0, 1))
p + facet_grid(.~EVTYPEmapped)
We use dataset processed previously, and now explore the columns related to property/crop damage. The columns PROPDMGEXP and CROPDMGEXP contain characters: k for thousand, m for million and b for billion.
unique(s$PROPDMGEXP)
## [1] K M B m + 0 5 6 4 h 2 7 3 H -
## Levels: - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
unique(s$CROPDMGEXP)
## [1] M K m B ? 0 k
## Levels: ? 0 2 B k K m M
Above code reveals that there are incorrect entries, such as ?, 2 etc. First we convert PROPDMGEXP and CROPDMGEXP into uppercase. Then we filter to extract only values such as K, M, B.
s$PROPDMGEXP <- toupper(s$PROPDMGEXP)
s$CROPDMGEXP <- toupper(s$CROPDMGEXP)
sub1 <- s %>% filter(PROPDMGEXP == "K"|PROPDMGEXP == "M"|PROPDMGEXP == "B"| CROPDMGEXP == "K"|CROPDMGEXP == "M"|CROPDMGEXP == "B")
sub1 <- sub1 %>% filter(PROPDMG != 0 | CROPDMG != 0)
nrow(sub1)
## [1] 244715
Here we created a custom function to compute total damage, output in billion
calcfn <- function(p, pExp, c, cExp) {
if(pExp == "K") {
fac1 <- 1000
}else if(pExp == "M") {
fac1 <- 1000000
}else if(pExp == "B") {
fac1 <- 1000000000
}else {
fac1 <- 0
}
val1 <- p * fac1
if(cExp == "K"){
fac2 <- 1000
}else if(cExp == "M") {
fac2 <- 1000000
}else if(cExp == "B") {
fac2 <- 1000000000
}else{
fac2 <- 0
}
val2 <- c * fac2
#to convert into billion
(val1 + val2)/1000000000
}
Using custom function, calcfn, we add computed value to a new column ‘DMG_bn’
sub1$DMG_bn <- calcfn(sub1$PROPDMG, sub1$PROPDMGEXP, sub1$CROPDMG, sub1$CROPDMGEXP)
Now we summarize by events, the total damage. Then we sort in descending order, taking top 10 for our analysis. We then plot the values.
plotData3 <- sub1 %>% group_by(EVTYPEmapped) %>% summarise(TotalDamage = sum(DMG_bn, na.rm=TRUE))
plotData3 <- plotData3[order(-plotData3$TotalDamage), ]
top10 <- plotData3[1:10, ]
ggplot() +
xlab("Property and Crop damage (billion$)") +
ylab("Severe weather event") +
theme(legend.title = element_blank(), legend.position = "none") +
labs(title = "Total damage in billions over period 1950-2011") +
geom_point(data = top10, aes(y = EVTYPEmapped, x = TotalDamage, size = 1)) +
theme(axis.text.x = element_text( size = 10, hjust = 0.5, vjust = 0.5)) +
theme(plot.title = element_text(colour = "gray0", size = 12, vjust = 1.5))
From the above plot it is very clear that tornado, thunderstorms, floods caused serious negative economic impact. Tornados have caused the most damage, cost amounting to over 3 billion dollars from 1950-2011.
The study of the storm data shows that across the United States, tornados cause the most negative impact on population health and economy.