This report is an analysis of this dataset of storm and weather events, published by the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database, in order to identify the most significant factors in terms of health and economic impact.
The weather event causing most fatalities in the US between 1995 and 2011 is Excessive Heat, closely followed by Tornado. Tornados cause by far the most injuries of any weather event.
In terms of economic cost, Flood has by far the most signficant impact, then Hurricane/Typhoon, followed by Storm Surge and Tornado.
The fine granularity of weather event categories could have an impact on the results. Further analysis of this is recommended.
filename <- "StormData.csv.bz2"
url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
if (!file.exists(filename)) {
download.file(url=url, destfile = filename, method="curl")
}
raw.data <- read.csv(
filename, header=TRUE, na.strings=c(""))
Having downloaded the data and read the file into variable raw.data, let’s look at its structure:
dim(raw.data)
## [1] 902297 37
str(raw.data)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29600 levels "5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13512 1872 4597 10591 4371 10093 1972 23872 24417 4597 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 34 levels " N"," NW","E",..: NA NA NA NA NA NA NA NA NA NA ...
## $ BGN_LOCATI: Factor w/ 54428 levels " Christiansburg",..: NA NA NA NA NA NA NA NA NA NA ...
## $ END_DATE : Factor w/ 6662 levels "1/1/1993 0:00:00",..: NA NA NA NA NA NA NA NA NA NA ...
## $ END_TIME : Factor w/ 3646 levels " 0900CST"," 200CST",..: NA NA NA NA NA NA NA NA NA NA ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 23 levels "E","ENE","ESE",..: NA NA NA NA NA NA NA NA NA NA ...
## $ END_LOCATI: Factor w/ 34505 levels " CANTON"," TULIA",..: NA NA NA NA NA NA NA NA NA NA ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 18 levels "-","?","+","0",..: 16 16 16 16 16 16 16 16 16 16 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 8 levels "?","0","2","B",..: NA NA NA NA NA NA NA NA NA NA ...
## $ WFO : Factor w/ 541 levels " CI","%SD","$AC",..: NA NA NA NA NA NA NA NA NA NA ...
## $ STATEOFFIC: Factor w/ 249 levels "ALABAMA, Central",..: NA NA NA NA NA NA NA NA NA NA ...
## $ ZONENAMES : Factor w/ 25111 levels " "| __truncated__,..: NA NA NA NA NA NA NA NA NA NA ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436780 levels "\t","\t\t","\t\t\t\t",..: NA NA NA NA NA NA NA NA NA NA ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
In summary, this dataset contains 902297 observations of 37 variables.
How complete is the data? Let’s look for NAs:
nas <- c()
for (i in 1:37) nas[i] <- sum(is.na(raw.data[,i]))
names(nas) <- names(raw.data)
print(nas[nas>0])
## COUNTYNAME BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTYENDN
## 1589 547332 287743 243411 238978 902297
## END_AZI END_LOCATI F PROPDMGEXP CROPDMGEXP WFO
## 724837 499225 843563 465934 618413 142069
## STATEOFFIC ZONENAMES LATITUDE LATITUDE_E REMARKS
## 248769 594029 47 40 287433
There are quite a lot of NAs but not in variables I’m interested in for this report. I can handle Property/Crop damage exponents (PROPDMGEXP and CROPDMGEXP) by treating the associated damage number as an absolute value (i.e. no exponent).
Firstly, let’s remove columns we’re not interested in:
dtc <- raw.data[,
-c(1,3,4,5,6,9,10,11,12,13,14,15,16,17,18,19,20,21,22,
29,30,31,32,33,34,35,36,37)]
The BGN_DATE variable is a factor, so this should be formatted as a Date; we can then create a new variable containing just the year:
date_format <- "%m/%d/%Y %H:%M:%S"
dtc[,"BGN_DATE"] <- as.Date(dtc[,"BGN_DATE"], date_format)
dtc$year <- as.POSIXlt(dtc[,"BGN_DATE"])$year+1900
The EVTYPE variable contains unnecessary whitespace, let’s remove it:
library(stringr)
dtc$EVTYPE <- factor(str_trim(dtc$EVTYPE))
Property and Crop damage values come in a pair of columns, containing a number and optional exponent component, ‘K’ (thousands), ‘M’ (millions) or ‘B’ (billions), respectively. Assessing property damage requires turning the numerical and exponential variables into a purely numerical value that can be used arithmetically. However, the values in the PROPDMGEXP and CROPDMGEXP variables are not limited to the expected values:
table(dtc$PROPDMGEXP)
##
## - ? + 0 1 2 3 4 5 6
## 1 8 5 216 25 13 4 4 28 4
## 7 8 B h H K m M
## 5 1 40 1 6 424665 7 11330
table(dtc$CROPDMGEXP)
##
## ? 0 2 B k K m M
## 7 19 1 9 21 281832 1 1994
exp <- c("k", "K", "m", "M", "b", "B", "", NA)
table(dtc$CROPDMGEXP %in% exp)
##
## FALSE TRUE
## 27 902270
table(dtc$PROPDMGEXP %in% exp)
##
## FALSE TRUE
## 321 901976
The majority of the exponent variables contain expected values, but some don’t. I’ll remove these and then use a helper function (makenum) to create new columns (CROPDMGNUM and PROPDMGNUM) with absolute numbers from the raw number and exponent variables (i.e ‘1’, ‘B’ becomes ‘1,000,000,000’):
dtc <- dtc[dtc$PROPDMGEXP %in% exp &
dtc$CROPDMGEXP %in% exp,]
dtc$PROPDMGEXP <- factor(dtc$PROPDMGEXP)
dtc$CROPDMGEXP <- factor(dtc$CROPDMGEXP)
source("functions.R")
dtc$PROPDMGNUM <- mapply(makenum, dtc$PROPDMG, dtc$PROPDMGEXP)
dtc$CROPDMGNUM <- mapply(makenum, dtc$CROPDMG, dtc$CROPDMGEXP)
The source of functions.R can be viewed here.
The processed data now look like this:
str(dtc)
## 'data.frame': 901949 obs. of 12 variables:
## $ BGN_DATE : Date, format: "1950-04-18" "1950-04-18" ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 977 levels "?","ABNORMAL WARMTH",..: 826 826 826 826 826 826 826 826 826 826 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 4 levels "B","K","m","M": 2 2 2 2 2 2 2 2 2 2 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 5 levels "B","k","K","m",..: NA NA NA NA NA NA NA NA NA NA ...
## $ year : num 1950 1950 1951 1951 1951 ...
## $ PROPDMGNUM: num 25000 2500 25000 2500 2500 2500 2500 2500 25000 25000 ...
## $ CROPDMGNUM: num 0 0 0 0 0 0 0 0 0 0 ...
head(dtc)
## BGN_DATE STATE EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG
## 1 1950-04-18 AL TORNADO 0 15 25.0 K 0
## 2 1950-04-18 AL TORNADO 0 0 2.5 K 0
## 3 1951-02-20 AL TORNADO 0 2 25.0 K 0
## 4 1951-06-08 AL TORNADO 0 2 2.5 K 0
## 5 1951-11-15 AL TORNADO 0 2 2.5 K 0
## 6 1951-11-15 AL TORNADO 0 6 2.5 K 0
## CROPDMGEXP year PROPDMGNUM CROPDMGNUM
## 1 <NA> 1950 25000 0
## 2 <NA> 1950 2500 0
## 3 <NA> 1951 25000 0
## 4 <NA> 1951 2500 0
## 5 <NA> 1951 2500 0
## 6 <NA> 1951 2500 0
The following questions were addressed in order to get some familiarity with the data.
yd <- as.data.frame(tapply(dtc$year, dtc$year, length))
yd$year <- row.names(yd)
row.names(yd) <- NULL
names(yd) <- c("count", "year")
library(ggplot2)
library(grid)
ye <- ggplot(yd, aes(x=year, y=count))
ye <- ye + geom_bar(stat="identity")
ye <- ye + xlab("Year") + ylab("Observations")
ye <- ye + ggtitle("Figure 1: Total Observations per Year")
ye <- ye + theme(axis.text.x = element_text(angle = 45, hjust = 1))
print(ye)
print(min(dtc$year)); nrow(dtc[dtc$year==min(dtc$year),])
## [1] 1950
## [1] 223
print(max(dtc$year)); nrow(dtc[dtc$year==max(dtc$year),])
## [1] 2011
## [1] 62173
This shows that there were very few events (223) recorded in the earliest year (1950) contained in the dataset, rising slowly during the 70s, 80s and early 90s before suddenly increasing in the mid 90s and then rising more rapidly to a peak of 62173 in 2011
Given the low rate of data collection in the earlier years and the increased relevance (and presumably lower sampling error rate) of later data, I shall limit this analysis to data recorded between 1995 and 2011 (inclusive).
dtc <- dtc[dtc$year >=1995,]
by.evtype <- tapply(dtc$year, dtc$EVTYPE, length)
by.evtype <- as.data.frame(by.evtype[order(by.evtype, decreasing=TRUE)])
names(by.evtype) <- "count"
head(by.evtype)
## count
## HAIL 215898
## TSTM WIND 128927
## THUNDERSTORM WIND 81743
## FLASH FLOOD 52667
## FLOOD 24640
## TORNADO 24320
The most frequently occuring event over the period is Hail, with nearly ten times as many recorded events as Flood and Tornado.
Here I group injuries and fatalities data by event type, summing up the number of each per event type. I then display the top ten of each in a panel plot.
byinjuries <- as.data.frame(tapply(dtc$INJURIES, dtc$EVTYPE, sum))
byinjuries$EVTYPE<-row.names(byinjuries)
byinjuries <- byinjuries[order(byinjuries[1], decreasing=TRUE),]
names(byinjuries) <- c("injuries", "evtype")
row.names(byinjuries) <- NULL
byfatalities <- as.data.frame(tapply(dtc$FATALITIES, dtc$EVTYPE, sum))
byfatalities$EVTYPE<-row.names(byfatalities)
byfatalities <- byfatalities[order(byfatalities[1], decreasing=TRUE),]
names(byfatalities) <- c("fatalities", "evtype")
row.names(byfatalities) <- NULL
library(ggplot2)
library(gridExtra)
gf <- ggplot(byfatalities[1:10,], aes(x=evtype, y=fatalities))
gf <- gf + geom_bar(stat="identity")
gf <- gf + xlab("") + ylab("Fatalities")
gf <- gf + theme(axis.text.x = element_text(angle = 45, hjust = 1))
gi <- ggplot(byinjuries[1:10,], aes(x=evtype, y=injuries))
gi <- gi + geom_bar(stat="identity")
gi <- gi + xlab("Weather Event Type") + ylab("Injuries")
gi <- gi + theme(axis.text.x = element_text(angle = 45, hjust = 1))
grid.arrange(gf, gi, nrow=2,
main="Figure 2: Health impact of US weather events (1995-2011)")
This figure plots the top 10 causes of fatalities and injuries, with the total number of each in the period 1995-2011. It shows that Excessive Heat causes the most fatalities, followed by Tornado. Tornados cause the most injuries by a significant margin.
Create a new variable TOTALDMG which is the sum of the CROPDMGNUM and PROPDMGNUM variables. Then group the data by event type, recording the sum of TOTALDMG of all observations of that weather event type.
dtc$TOTALDMG <- dtc$CROPDMGNUM + dtc$PROPDMGNUM
bydmg <- as.data.frame(tapply(dtc$TOTALDMG, dtc$EVTYPE, sum))
bydmg$EVTYPE<-row.names(bydmg)
bydmg <- bydmg[order(bydmg[1], decreasing=TRUE),]
names(bydmg) <- c("cost", "evtype")
bn <- 10^9
scale <- signif(max(bydmg$cost, na.rm=TRUE), 1) / bn
library(ggplot2)
gc <- ggplot(bydmg[1:10,], aes(x=factor(evtype), y=cost))
gc <- gc + geom_bar(stat="identity")
gc <- gc + xlab("Event Type") + ylab("Total Cost ($)")
gc <- gc + ggtitle("Figure 3: Total cost of US weather events (1995-2011)")
gc <- gc + scale_y_continuous(
breaks=bn * seq(0,150, by=25),
labels=paste0(seq(0,150, by=25), "bn"),
limits=bn * c(0,155)
)
gc <- gc + theme(axis.text.x = element_text(angle = 45, hjust = 1))
print(gc)
This figure shows the total cost of the top ten weather event types. It shows that the most costly type of weather event is Flood, followed by Hurricane/Typhoon, then Storm Surge and Tornado.
The results indicate that the granularity of weather event categorisation could affect the stated impact of certain types of weather event. For instance, perhaps the results would be different if High Wind, Hurricane, Hurricane/Typhoon, Tornado & Tropical Storm were all treated as the same type of event.
This report was produced on the following hardware setup:
| Attribute | Detail |
|---|---|
| OS | OS X Yosemite 10.10.1 |
| Processor | 3.4 GHz Intel Core i7 |
| Memory | 16GB 1333 MHz DDR3 |
| Graphics | AMD Radeon HD 6970M 2048 MB |
The R environment is as follows:
sessionInfo()
## R version 3.1.2 (2014-10-31)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
##
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
##
## attached base packages:
## [1] grid stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] gridExtra_0.9.1 ggplot2_1.0.0 stringr_0.6.2 knitr_1.8
##
## loaded via a namespace (and not attached):
## [1] colorspace_1.2-4 digest_0.6.4 evaluate_0.5.5 formatR_1.0
## [5] gtable_0.1.2 htmltools_0.2.6 labeling_0.3 lattice_0.20-29
## [9] MASS_7.3-35 munsell_0.4.2 plyr_1.8.1 proto_0.3-10
## [13] Rcpp_0.11.3 reshape2_1.4 rmarkdown_0.3.10 scales_0.2.4
## [17] tools_3.1.2 yaml_2.1.13
R Studio version used is:
Version 0.98.1091 – © 2009-2014 RStudio, Inc.
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko)