Many weather events have a severe impacts on the population of the affected area.
The NOAA Storm Event Database tracked hundreds of thousands of storms and other events
as well as damage on property, crops and cases of Injuries or Fatalities.
This paper analyses the dataset and compares the total impact of weather events
on the population of the US from the first January of 1996 until 2011.
The different types of events taken into consideration are from the list of weather events
in the documentation of the dataset.
The Analysis will be split in the two categories economic- and healthdamage and
will rank the danger of certain events independently in those categories.
For this analysis the NOAA Storm Event database has to be transformed.
Because only since 1996 all events were tracked, earlier data will be ignored to
avoid biases. Since there are almost 1000 Weather events tracked they will be replaced
with the official list. Afterwards the information for the effects on population health and property will be extracted.
The first step of the analysis is to download the NOAA Storm event database as well
as the documentation and then extract the required data from the documentation.
THe first step is to download the raw files from the provided links.
#get the NOAA Storm Event Database
stormurl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
#get the Documentation for the database to extract the official list of events
docurl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf"
if(!file.exists("stormdata.bz2")){
download.file(stormurl,"stormdata.bz2")
}
if(!file.exists("NOAAdocumantion.pdf")){
download.file(docurl,"NOAAdocumentation.pdf", mode="wb")
}
The database can be read without decompressing.
stormdata <- read.csv("stormdata.bz2", na.strings = c("NA",""), encoding = "UTF-8")
The official eventlist can be extracted from the documentation with some tricks.
With the PDFtools package the text of the documentationcan be extracted.
Since the the list can be found on page 6 in the lines 9:32 these informations can be extracted with
the strsplit function. Finally to ignore whitespaces and single characters,
only strings longer than three characters are inserted into the list.
require(pdftools)
## Loading required package: pdftools
documentation <- pdf_text("NOAAdocumentation.pdf")
listpage <- unlist(strsplit(documentation[6],"\n"))
splitpage <- unlist(strsplit(listpage[9:32]," "))
eventlist <- as.character(splitpage[nchar(splitpage)>3])
Finally with the stringdist package and approximate string matching, the events
can be matched to the official list.
require(stringdist)
## Loading required package: stringdist
stormdata$NORMEVENT <- eventlist[amatch(toupper(stormdata$EVTYPE),toupper(eventlist), maxDist = 11)]
After cleaning the dataset, we extract the useful information for the effects on
population health.
require(lubridate)
## Loading required package: lubridate
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
require(dplyr)
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:lubridate':
##
## intersect, setdiff, union
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
evdata <- stormdata[mdy_hms(stormdata$BGN_DATE)>dmy("1.1.1996"),c("NORMEVENT","FATALITIES","INJURIES")]
#since only tornadoes whave been tracked until jan 1st 1996, these years will be excluded
#as they don't provide any useful information
as last step, the data is groupd by event type and normalised
healthdamage <- evdata %>% group_by(NORMEVENT) %>% summarise_all(sum) %>% arrange(desc(FATALITIES))
#the both categories are normalised for easier comparison
healthdamage$FATALITIES <- healthdamage$FATALITIES / sum(healthdamage$FATALITIES)
healthdamage$INJURIES <- healthdamage$INJURIES / sum(healthdamage$INJURIES)
Next we extract data for the economic consequences. To do this we have not only raw values,
but also exponents. These exponent variables have poorly documented meanings.
But from looking at the data from 1996 onwards. only “K”, “M”,“B” and NA’s are found. The other values for the exponent are mosty found in older data and are explained in depth in
this report
ecodata <- stormdata[mdy_hms(stormdata$BGN_DATE)>dmy("1.1.1996"),
c("NORMEVENT","PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP")]
#find out absolute damage by muliplying the DMG Variables with the EXP variables
#first we find out the subsets of the set which need to be multiplied
#the isTRUE deals with the NAs in the EXP Variables for us
p0sub <- isTRUE(ecodata$PROPDMGEXP == "0")
pksub <- isTRUE(ecodata$PROPDMGEXP == "K")
pmsub <- isTRUE(ecodata$PROPDMGEXP == "M")
pbsub <- isTRUE(ecodata$PROPDMGEXP == "B")
cksub <- isTRUE(ecodata$CROPDMGEXP == "K")
cmsub <- isTRUE(ecodata$CROPDMGEXP == "M")
cbsub <- isTRUE(ecodata$CROPDMGEXP == "B")
#then we actually change the values
ecodata$PROPDMG[p0sub] <- ecodata$PROPDMG[p0sub] * 10
ecodata$PROPDMG[pksub] <- ecodata$PROPDMG[pksub] * 1000
ecodata$PROPDMG[pmsub] <- ecodata$PROPDMG[pmsub] * 1000000
ecodata$PROPDMG[pbsub] <- ecodata$PROPDMG[pbsub] * 1000000000
ecodata$CROPDMG[cksub] <- ecodata$CROPDMG[cksub] * 1000
ecodata$CROPDMG[cmsub] <- ecodata$CROPDMG[cmsub] * 1000000
ecodata$CROPDMG[cbsub] <- ecodata$CROPDMG[cbsub] * 1000000000
As last step the data will be grouped by event type and normalised.
absprdmg <- sum(ecodata$PROPDMG)
abscrdmg <- sum(ecodata$CROPDMG)
#the both values in these groups are now normalized for easier comparison
propdamage <- ecodata %>% group_by(NORMEVENT) %>% summarise(propdamage = sum(PROPDMG)/absprdmg,cropdamage = sum(CROPDMG)/abscrdmg)
In this section, results of the analysis are shown and it will be discussed which
events have the biggest impact on the population of the us.
To showcase the impact of weather events on the population, the data will be cropped to ignore events with very little effects. While plotting the events are ordered by the amount of fatalities.
#subset the data so only events with damage above 0.05 of the total damage by waether events are shown
require(tidyr)
## Loading required package: tidyr
require(ggplot2)
## Loading required package: ggplot2
require(scales)
## Loading required package: scales
healthdamage <- subset(healthdamage, FATALITIES>0.005|INJURIES>0.005)
showdamage <- gather(healthdamage,"type","amount",-NORMEVENT)
p<-ggplot(data=showdamage, aes(x=reorder(showdamage$NORMEVENT,showdamage$amount*(showdamage$type == "FATALITIES")), y=amount, fill = type)) +
geom_bar(stat="identity", position=position_dodge(.5))+
theme_minimal()+
labs(title = "Effect of different Weather Events on Population health",
caption = "Data source: NOAA Storm Event Database")+
scale_x_discrete(name = "Type of Weather event")+
scale_y_continuous(name = "relative impact on population health",
labels = percent)
# Horizontal bar plot
p + coord_flip()
Looking at the data many surprising things can be seen. Excessive heat, even though
not seeming that dangerous killed the most people in the US, closely followed by tornados
which not only kill a lot of people but are the cause of more than a third of people
injured by weather related events. Other very harmful events like flash floods,
lightnings, rip currents etc. seem to have a certain surprising factor in common
which often leaves little to no time to prepare. The obvious exception being the excessive heat
which could also feel sudden for people suddenly stuck in the heat without water supply etc.
For example when a car breaks down somewhere in death valley, the passengers are often unprepared.
This, however is just an attempt to interpret the data and has to be validated with the database.
Similar to the impact on population health the data here is cropped and sorted.
#subset the data so only events with damage above 0.05 of the total damage by waether events are shown
filtereddamage <- subset(propdamage, propdamage>0.005|cropdamage>0.005)
showdamage <- gather(filtereddamage,"type","amount",-NORMEVENT)
#showdamage$NORMEVENT <- factor(reorder(showdamage$NORMEVENT,showdamage[showdamage$type =="totdamage"]$amount))
p<-ggplot(data=showdamage, aes(x=reorder(showdamage$NORMEVENT,showdamage$amount*(showdamage$type == "propdamage")), y=amount, fill = type)) +
geom_bar(stat="identity", position=position_dodge(.5))+
theme_minimal()+
labs(title = "Effect of different Weather Events on property and crops",
caption = "Data source: NOAA Storm Event Database")+
scale_x_discrete(name = "Type of Weather event")+
scale_y_continuous(name = "relative damage",
labels = percent)
# Horizontal bar plot
p + coord_flip()
Compared to the effects on population health, a smaller selection of events seems to cause great
damage to crops and property.
The leading causes for property damage, high winds, flash floods and tornados do not seem very surprising,
as there are little precautions which can be taken to protect many kinds of property from these events. With almost 40 percent of total crop damage, hail stands out extremely to other events.
This also seems plausible, as nothing can be done to pretect crops from hail.
Even though the data preocessing step used automated approximate string matching,
a method which is prone to erros, the results seemed plausible and can be explained easily. To prove the claims, additional analysis has to be conducted on the database.