In this document, the Storm Data collected by NOAA (https://www.ncdc.noaa.gov/stormevents/details.jsp) is analysed, to answer two questions. 1.Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health? 2.Across the United States, which types of events have the greatest economic consequences? To answer these questions, the event types in the NOAA database were matched to 48 standard events. The economic damage was adjusted for inflation. Finally, because the time over which different events were collected was significantly different, the effect of each event per year was calculated. It was shown that tornados have the biggest influence on population health, while floods have the biggest economic impact.
First, the required libraries are loaded:
library("data.table")
library("R.utils")
## Loading required package: R.oo
## Loading required package: R.methodsS3
## R.methodsS3 v1.7.0 (2015-02-19) successfully loaded. See ?R.methodsS3 for help.
## R.oo v1.19.0 (2015-02-27) successfully loaded. See ?R.oo for help.
##
## Attaching package: 'R.oo'
##
## Die folgenden Objekte sind maskiert von 'package:methods':
##
## getClasses, getMethods
##
## Die folgenden Objekte sind maskiert von 'package:base':
##
## attach, detach, gc, load, save
##
## R.utils v2.1.0 (2015-05-27) successfully loaded. See ?R.utils for help.
##
## Attaching package: 'R.utils'
##
## Das folgende Objekt ist maskiert 'package:utils':
##
## timestamp
##
## Die folgenden Objekte sind maskiert von 'package:base':
##
## cat, commandArgs, getOption, inherits, isOpen, parse, warnings
library("ggplot2")
library("dplyr")
##
## Attaching package: 'dplyr'
##
## Die folgenden Objekte sind maskiert von 'package:data.table':
##
## between, last
##
## Die folgenden Objekte sind maskiert von 'package:stats':
##
## filter, lag
##
## Die folgenden Objekte sind maskiert von 'package:base':
##
## intersect, setdiff, setequal, union
library("stringdist")
library("cowplot")
##
## Attaching package: 'cowplot'
##
## Das folgende Objekt ist maskiert 'package:ggplot2':
##
## ggsave
Next, the data is downloaded, unzipped and read in. The fread-function from the data.table package is used, because it provides fast performance. It requires unzipped data, which is done using the R.utils package’s gunzip:
if(!file.exists("repdata_data_StormData.csv.bz2")) download.file(url="https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", destfile = "repdata_data_StormData.csv.bz2")
if(!file.exists("repdata_data_StormData.csv")) gunzip(filename="repdata_data_StormData.csv.bz2",destname="repdata_data_StormData.csv")
stormdata <- fread("repdata_data_StormData.csv")
##
Read 10.3% of 967216 rows
Read 30.0% of 967216 rows
Read 47.6% of 967216 rows
Read 58.9% of 967216 rows
Read 76.5% of 967216 rows
Read 81.7% of 967216 rows
Read 89.9% of 967216 rows
Read 902297 rows and 37 (of 37) columns from 0.523 GB file in 00:00:09
## Warning in fread("repdata_data_StormData.csv"): Read less rows (902297)
## than were allocated (967216). Run again with verbose=TRUE and please
## report.
Some oberservations contain neither damage to population health nor financial damage. To remove them, the damage categories are summed and those equal to zero deleted:
stormdata <- mutate(stormdata, totaldamage = FATALITIES + INJURIES + PROPDMG + CROPDMG)
rowsbefore <- nrow(stormdata)
stormdata <- filter(stormdata, totaldamage != 0)
rowsafter <- nrow(stormdata)
reductionpercentage <- round(1-(rowsafter/rowsbefore),digits = 3)*100
The number of observations could be reduced by 71.8% from 902297 rows to 254633 rows.
Next, the EVTYPE oberservation from the data is matched to the 48 event types from https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf. These were copied and pasted into a CSV file called eventtypes.csv, as it is very difficult to programmatically extract this information from the pdf in a reproducible way. To make this research reproducible, the resulting list of 48 storm types is printed out in this document:
eventtypes <- read.csv("eventtypes.csv", header=FALSE, stringsAsFactors = FALSE)
eventtypes$V1 <- substr(eventtypes$V1,1,nchar(eventtypes$V1)-2)
eventtypes
## V1
## 1 Astronomical Low Tide
## 2 Avalanche
## 3 Blizzard
## 4 Coastal Flood
## 5 Cold/Wind Chill
## 6 Debris Flow
## 7 Dense Fog
## 8 Dense Smoke
## 9 Drought
## 10 Dust Devil
## 11 Dust Storm
## 12 Excessive Heat
## 13 Extreme Cold/Wind Chill
## 14 Flash Flood
## 15 Flood
## 16 Frost/Freeze
## 17 Funnel Cloud
## 18 Freezing Fog
## 19 Hail
## 20 Heat
## 21 Heavy Rain
## 22 Heavy Snow
## 23 High Surf
## 24 High Wind
## 25 Hurricane (Typhoon)
## 26 Ice Storm
## 27 Lake-Effect Snow
## 28 Lakeshore Flood
## 29 Lightning
## 30 Marine Hail
## 31 Marine High Wind
## 32 Marine Strong Wind
## 33 Marine Thunderstorm Wind
## 34 Rip Current
## 35 Seiche
## 36 Sleet
## 37 Storm Surge/Tide
## 38 Strong Wind
## 39 Thunderstorm Wind
## 40 Tornado
## 41 Tropical Depression
## 42 Tropical Storm
## 43 Tsunami
## 44 Volcanic Ash
## 45 Waterspout
## 46 Wildfire
## 47 Winter Storm
## 48 Winter Weather
Then the EVTYPs are matched to the new list of 48 event types. First, the unique EVTYPs are extracted, then the minimum Levenshtein distance is used to match them:
uniqueEVTYPE <- as.data.frame(unique(stormdata$EVTYPE), stringsAsFactors = FALSE)
for(i in 1:nrow(uniqueEVTYPE)) {
uniqueEVTYPE[i,2] <- eventtypes[amatch(uniqueEVTYPE[i,1],eventtypes$V1,maxDist = Inf),1]
}
colnames(uniqueEVTYPE) <- c("find","replace")
A quick check shows that many events are now coded correctly, but there are flaws in this technique. To improve the results, a few manual corrections are applied:
for(i in 1:nrow(uniqueEVTYPE)) {
uniqueEVTYPE[i,1] <- gsub("\\(|\\)", "", uniqueEVTYPE[i,1]) ## remove brackets
if (grepl("tornado",uniqueEVTYPE[i,1],ignore.case = TRUE))uniqueEVTYPE[i,2] <- eventtypes[40,1]
if (grepl("thunderstorm",uniqueEVTYPE[i,1],ignore.case = TRUE))uniqueEVTYPE[i,2] <- eventtypes[39,1]
if (grepl("flood",uniqueEVTYPE[i,1],ignore.case = TRUE))uniqueEVTYPE[i,2] <- eventtypes[15,1]
if (grepl("tstm",uniqueEVTYPE[i,1],ignore.case = TRUE))uniqueEVTYPE[i,2] <- eventtypes[39,1]
if (grepl("hurricane",uniqueEVTYPE[i,1],ignore.case = TRUE))uniqueEVTYPE[i,2] <- eventtypes[25,1]
if (grepl("typhoon",uniqueEVTYPE[i,1],ignore.case = TRUE))uniqueEVTYPE[i,2] <- eventtypes[25,1]
}
The resulting list is still not perfect, but it can be used to correct most coding errors in the original EVTYPs variable:
stormdata$EVTYPE <- with(uniqueEVTYPE,replace[match(stormdata$EVTYPE,find)])
Next, to determine the financial damage, the actual value has to be multiplied with the multiplier. According to the codebook, K, M and B equal Thousands, Millions and Billions. All other values are defaulted to 1000, because they are undefined.
stormdata <- mutate(stormdata, PROPDMGEXP = ifelse(PROPDMGEXP=="K",1E3,
ifelse(PROPDMGEXP=="M",1E6,
ifelse(PROPDMGEXP=="m",1E6,
ifelse(PROPDMGEXP=="B",1E9,1E3)))),
CROPDMGEXP = ifelse(CROPDMGEXP=="K",1E3,
ifelse(CROPDMGEXP=="M",1E6,
ifelse(CROPDMGEXP=="m",1E6,
ifelse(CROPDMGEXP=="B",1E9,1E3)))),
CROPDMG = CROPDMG * CROPDMGEXP, PROPDMG = PROPDMG * PROPDMGEXP)
To adjust for inflation, a solution which was originally proposed on http://stackoverflow.com/questions/12590180/inflation-adjusted-prices-package is used. It downloads a table of inflation rates and calculates an inflation factor for each year. They are then applied to CROPDMG and PROPDMG.
monthly_cpi <- read.csv("http://research.stlouisfed.org/fred2/data/CPIAUCSL.csv", header = TRUE)
monthly_cpi$cpi_year <- year(monthly_cpi$DATE)
yearly_cpi <- monthly_cpi %>% group_by(cpi_year) %>% summarize(cpi = mean(VALUE))
yearly_cpi$adj_factor <- yearly_cpi$cpi/yearly_cpi$cpi[yearly_cpi$cpi_year == 2015]
stormdata <- mutate(stormdata, BGN_DATE = as.POSIXct(BGN_DATE, format="%m/%d/%Y %H:%M:%S"))
stormdata <- mutate(stormdata, PROPDMG = PROPDMG / yearly_cpi$adj_factor[match(as.integer(format(stormdata$BGN_DATE, format="%Y")),yearly_cpi$cpi_year)], CROPDMG = CROPDMG / yearly_cpi$adj_factor[match(as.integer(format(stormdata$BGN_DATE, format="%Y")),yearly_cpi$cpi_year)])
As for damage to people, the database contains approximately 10 times more injuries than fatalities:
sum(stormdata$INJURIES)
## [1] 140528
sum(stormdata$FATALITIES)
## [1] 15145
It is therefore assumed that fatalities should weigh ten times more than injuries and a new variable called HEALTHDMG, which is the sum of all inuries and ten times the fatalities:
stormdata <- mutate(stormdata, HEALTHDMG=INJURIES + (FATALITIES*10))
All required data is now contained in the columns BGN_DATE, EVTYP, TOTALDMG and HEALTHDMG, so that the rest of the data can be removed:
stormdata <- select(stormdata, BGN_DATE, EVTYPE, CROPDMG, PROPDMG, HEALTHDMG)
The resulting data is grouped by EVTYPE toanswer the given questions:
stormdata <- group_by(stormdata, EVTYPE)
Finally, to prepare plotting the data, the health damage related data is extracted:
healthdmg <- summarise(stormdata, HEALTHDMG = sum(HEALTHDMG))
healthdmg <- healthdmg[order(-HEALTHDMG)]
healthdmg <- healthdmg[1:10,]
ecodamage <- summarise(stormdata, CROPDMG = sum(CROPDMG)/1E9, PROPDMG=sum(PROPDMG)/1E9)
ecodamage <- mutate(ecodamage, TOTALDMG = CROPDMG+PROPDMG)
ecodamage <- ecodamage[order(-TOTALDMG)]
ecodamage <- ecodamage[1:10,]
ecodamage <- select(ecodamage, EVTYPE, CROPDMG, PROPDMG)
ecodamage <- melt(ecodamage, id.vars = "EVTYPE")
The following graph shows the Top Ten storm events most harmful to the economy.
ECODMGPLOT <- ggplot(ecodamage,aes(x=reorder(EVTYPE,-value),y=value,fill=variable)) +
geom_bar(stat="identity") +
labs(x="Event Type",y="Economical damage in billions USD", title="Top Ten Storm Events harming the economy") +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
scale_fill_discrete(name="Type of Damage",
breaks=c("CROPDMG", "PROPDMG"),
labels=c("Crop Damage", "Property Damage"))
print(ECODMGPLOT)
One disadvantage of this analysis is, that some events have been recorded for a longer time than others. As can be seen on “https://www.ncdc.noaa.gov/stormevents/details.jsp”, Tornados have been recorded since 1950, Thunderstorms and Hail since 1955, and all other events since 1996. To compensate for this effect the following figure shows plots similar to above, but this time the impact PER YEAR is plotted, so the data will be more comparable. Furthermore, a similar plot for population health effects is created:
healthdmg <- summarise(stormdata, HEALTHDMG = sum(HEALTHDMG))
healthdmg <- mutate(healthdmg, DMGPERYEAR = ifelse(EVTYPE == eventtypes$V1[19] |EVTYPE == eventtypes$V1[39], HEALTHDMG/57,
ifelse(EVTYPE == eventtypes$V1[40],HEALTHDMG/62, HEALTHDMG/16)))
healthdmg <- healthdmg[order(-DMGPERYEAR)]
healthdmg <- healthdmg[1:10,]
HEALTHDMGPERYEAR <- ggplot(healthdmg,aes(x=reorder(EVTYPE,-DMGPERYEAR),y=DMGPERYEAR)) +
geom_bar(stat="identity", color="white", fill="deepskyblue3") +
labs(title="Health damage",y="Yearly damage to population health", x="") +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))
ecodamage <- summarise(stormdata, CROPDMG = sum(CROPDMG)/1E9, PROPDMG=sum(PROPDMG)/1E9)
ecodamage <- mutate(ecodamage, TOTALDMG = CROPDMG+PROPDMG)
ecodamage <- mutate(ecodamage, CROPPERYEAR = ifelse(EVTYPE == eventtypes$V1[19] |EVTYPE == eventtypes$V1[39], CROPDMG/57,
ifelse(EVTYPE == eventtypes$V1[40],CROPDMG/62, CROPDMG/16)))
ecodamage <- mutate(ecodamage, PROPPERYEAR = ifelse(EVTYPE == eventtypes$V1[19] |EVTYPE == eventtypes$V1[39], PROPDMG/57,
ifelse(EVTYPE == eventtypes$V1[40],PROPDMG/62, PROPDMG/16)))
ecodamage <- mutate(ecodamage, DMGPERYEAR = ifelse(EVTYPE == eventtypes$V1[19] |EVTYPE == eventtypes$V1[39], TOTALDMG/57,
ifelse(EVTYPE == eventtypes$V1[40],TOTALDMG/62, TOTALDMG/16)))
ecodamage <- ecodamage[order(-DMGPERYEAR)]
ecodamage <- ecodamage[1:10,]
ecodamage <- select(ecodamage, EVTYPE, CROPPERYEAR, PROPPERYEAR)
ecodamage <- melt(ecodamage, id.vars = "EVTYPE")
ECODMGPLOTPERYEAR <- ggplot(ecodamage,aes(x=reorder(EVTYPE,-value),y=value,fill=variable)) +
geom_bar(stat="identity") +
labs(y="Yearly Economical damage (billions USD)",title="Economic Damage", x="") +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
scale_fill_discrete(name="Type of Damage",
breaks=c("CROPPERYEAR", "PROPPERYEAR"),
labels=c("Crop Damage", "Property Damage"))
plot_grid(ECODMGPLOTPERYEAR,HEALTHDMGPERYEAR,nrow=2,ncol=1)
It shows, that while the Top Event (flood for economic impact) remains the same, there are significant changes on the follow up events. For example, while tornados were the number two event overall for economic impact, they are only number four when analysing the yearly impact. Please note, that the units for health impact respresent 1 injury or 1/10th fatalities, which means that fatalities count ten-fold.
As a conclusion, the most harmful events are Tornados for population health and floods for economic impact.
It has to be noted though that this interpretation of yearly values does not cater for changes in the quality of data collection over the years nor for climatic changes which might have changed event frequencies.
This document was created in the following R environment:
sessionInfo()
## R version 3.2.2 (2015-08-14)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 8 x64 (build 9200)
##
## locale:
## [1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252
## [3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C
## [5] LC_TIME=German_Germany.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] cowplot_0.5.0 stringdist_0.9.3 dplyr_0.4.3 ggplot2_1.0.1
## [5] R.utils_2.1.0 R.oo_1.19.0 R.methodsS3_1.7.0 data.table_1.9.6
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.1 knitr_1.11 magrittr_1.5 MASS_7.3-43
## [5] munsell_0.4.2 colorspace_1.2-6 R6_2.1.1 stringr_1.0.0
## [9] plyr_1.8.3 tools_3.2.2 parallel_3.2.2 grid_3.2.2
## [13] gtable_0.1.2 DBI_0.3.1 htmltools_0.2.6 lazyeval_0.1.10
## [17] assertthat_0.1 yaml_2.1.13 digest_0.6.8 reshape2_1.4.1
## [21] formatR_1.2.1 evaluate_0.8 rmarkdown_0.8.1 labeling_0.3
## [25] stringi_0.5-5 scales_0.3.0 chron_2.3-47 proto_0.3-10