Prepared by: Colin Pistell
In this report we examine the public heath and economic impact of severe weather events since 1950. We used data from the NOAA Storm Database to determine that tornadoes have had the greatest impact on public health while flooding has had the greatest impact on the economy. We then examine what locations around the country are most likely to be affected by tornadoes and flooding and determine that Texas, Florida and much of the South and Mid-west should take steps to be well prepared for the impact of tornadoes while the mid-west should be more prepared for flooding.
First, we load some required packages and take a snapshot of our Session Info:
require("dplyr")
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
require("ggplot2")
## Loading required package: ggplot2
require("reshape2")
## Loading required package: reshape2
require("maps")
## Loading required package: maps
sessionInfo()
## R version 3.1.1 (2014-07-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] maps_2.3-9 reshape2_1.4 ggplot2_1.0.0 dplyr_0.3.0.2
##
## loaded via a namespace (and not attached):
## [1] assertthat_0.1 colorspace_1.2-4 DBI_0.3.1 digest_0.6.4
## [5] evaluate_0.5.5 formatR_1.0 grid_3.1.1 gtable_0.1.2
## [9] htmltools_0.2.6 knitr_1.7 magrittr_1.0.1 MASS_7.3-33
## [13] munsell_0.4.2 parallel_3.1.1 plyr_1.8.1 proto_0.3-10
## [17] Rcpp_0.11.3 rmarkdown_0.3.8 scales_0.2.4 stringr_0.6.2
## [21] tools_3.1.1 yaml_2.1.13
Next, we load the dataset into R. The dataset must be downloaded from here (47megs) and unzipped into our working directory.
stormdata <- read.csv("repdata-data-StormData.csv")
Now we can get some basic info about the dataset:
dim(stormdata)
## [1] 902297 37
names(stormdata)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
We can see that there are 37 variables and 902297 rows. However, we only care about variables associated with Event Type, Casualty and Economic damage info, and perhaps State information. So we can filter our dataset accordingly:
strm <- select(stormdata, EVTYPE, STATE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP) %>% group_by(EVTYPE)
Next we want to look at the recorded Event Types to see if we can eliminate some. First we summarize the data by how many of each Event appears:
tot <- summarize(strm, count = n()) %>% arrange(desc(count))
head(tot)
## Source: local data frame [6 x 2]
##
## EVTYPE count
## 1 HAIL 288661
## 2 TSTM WIND 219940
## 3 THUNDERSTORM WIND 82563
## 4 TORNADO 60652
## 5 FLASH FLOOD 54277
## 6 FLOOD 25326
tail (tot)
## Source: local data frame [6 x 2]
##
## EVTYPE count
## 1 WIND/HAIL 1
## 2 WINTER STORM HIGH WINDS 1
## 3 WINTER STORM/HIGH WIND 1
## 4 WINTER STORM/HIGH WINDS 1
## 5 Wintry Mix 1
## 6 WND 1
table(tot$count)
##
## 1 2 3 4 5 6 7 8 9 10
## 489 140 60 43 28 23 13 14 7 9
## 11 12 13 14 15 16 17 18 19 20
## 6 4 5 3 1 2 3 2 5 3
## 21 22 23 24 25 26 27 28 29 30
## 3 3 5 2 2 1 1 1 1 2
## 32 33 34 36 37 38 39 43 45 46
## 3 1 1 2 1 1 1 1 2 1
## 47 48 51 53 56 59 60 61 64 72
## 1 2 1 2 1 1 1 2 1 1
## 74 81 84 87 88 90 98 101 103 120
## 2 1 1 1 1 1 1 1 1 1
## 126 135 141 143 146 148 154 173 174 186
## 1 1 1 1 1 1 1 1 2 1
## 196 204 228 249 250 261 304 340 386 427
## 1 1 1 1 1 1 1 1 1 1
## 442 470 538 539 587 600 624 636 650 655
## 1 1 1 1 1 1 1 1 1 1
## 682 690 725 767 1002 1028 1104 1293 1342 1457
## 1 1 1 1 1 1 1 1 1 1
## 1533 1678 2006 2488 2719 2761 3392 3566 3796 5812
## 1 1 1 1 1 1 1 1 1 1
## 6175 6839 7026 11433 11723 15708 15754 20212 20843 25326
## 1 1 1 1 1 1 1 1 1 1
## 54277 60652 82563 219940 288661
## 1 1 1 1 1
We can see that nearly 500 Event Types only have 1 entry! To keep the data tidy we’ll eliminate any Event Type that has fewer than 100 entries:
tot2 <- filter(tot, count >= 100) %>% arrange(desc(count))
head(tot2)
## Source: local data frame [6 x 2]
##
## EVTYPE count
## 1 HAIL 288661
## 2 TSTM WIND 219940
## 3 THUNDERSTORM WIND 82563
## 4 TORNADO 60652
## 5 FLASH FLOOD 54277
## 6 FLOOD 25326
nrow(tot2)/nrow(tot) #Checking % of original rows left
## [1] 0.07005076
sum(tot2$count)/sum(tot$count) #Checking what % of information we retained
## [1] 0.9947756
As seen above, by retaining only 7% of the Event Types we still manage to capture over 99% of the records - and we will make our lives much easier down the road. Let’s filter our dataset accordingly and group by Event Type:
strm <- filter(strm, EVTYPE %in% tot2$EVTYPE) %>% group_by(EVTYPE)
Now we need to deal with the Damage multiplyers. First, we’ll remove any row with odd or nonsensical symbols, then convert the remaining symbols into numbers and finally calculate the true economic impact.
## Filter out weird symbols
table(strm$PROPDMGEXP)
##
## - ? + 0 1 2 3 4 5
## 462659 1 8 3 214 25 13 4 4 27
## 6 7 8 B h H K m M
## 4 5 1 22 0 6 423429 6 11152
table(strm$CROPDMGEXP)
##
## ? 0 2 B k K m M
## 614249 6 18 1 7 20 281379 0 1903
strm2 <- filter(strm, CROPDMGEXP %in% c("", "B", "k", "K", "m", "M"))
strm2 <- filter(strm2, PROPDMGEXP %in% c("", "B", "H", "K", "m", "M"))
strm2$CROPDMGEXP <- tolower(strm2$CROPDMGEXP)
strm2$PROPDMGEXP <- tolower(strm2$PROPDMGEXP)
## Convert Crop multiplyer
x <- as.character(strm2$CROPDMGEXP)
x <- as.factor(x)
levels(x)
## [1] "" "b" "k" "m"
levels(x) <- c(1, 1000000000, 1000, 1000000)
x <- as.numeric(as.character(x))
strm2$CROPDMGEXP <- x
## Convert Prop multiplyer
x <- as.character(strm2$PROPDMGEXP)
x <- as.factor(x)
levels(x)
## [1] "" "b" "h" "k" "m"
levels(x) <- c(1, 1000000000, 100, 1000, 1000000)
x <- as.numeric(as.character(x))
strm2$PROPDMGEXP <- x
## Clean up
rm(x)
strm2$propertydamage <- strm2$PROPDMG * strm2$PROPDMGEXP
strm2$cropdamage <- strm2$CROPDMG * strm2$CROPDMGEXP
Finally, we can summarize our data into useful information for answering our questions:
dat <- summarize(strm2, Deaths = sum(FATALITIES), Injuries = sum(INJURIES), PropertyDmg = sum(propertydamage), CropDmg = sum(cropdamage))
colnames(dat)[1] <- "Event" #Give our Event Types a more human readable name
First, we’ll look at the public health impact of severe weather events. We’ll look at the top 10 causes of death and injury and see how many event types they have in common:
deaths <- arrange(dat, desc(Deaths)) %>% select(Event, Deaths) %>% head(n = 10)
injuries <- arrange(dat, desc(Injuries)) %>% select(Event, Injuries) %>% head(n = 10)
deaths
## Source: local data frame [10 x 2]
##
## Event Deaths
## 1 TORNADO 5630
## 2 EXCESSIVE HEAT 1903
## 3 FLASH FLOOD 978
## 4 HEAT 937
## 5 LIGHTNING 816
## 6 TSTM WIND 504
## 7 FLOOD 470
## 8 RIP CURRENT 368
## 9 HIGH WIND 246
## 10 AVALANCHE 224
injuries
## Source: local data frame [10 x 2]
##
## Event Injuries
## 1 TORNADO 91321
## 2 TSTM WIND 6957
## 3 FLOOD 6789
## 4 EXCESSIVE HEAT 6525
## 5 LIGHTNING 5230
## 6 HEAT 2100
## 7 ICE STORM 1975
## 8 FLASH FLOOD 1777
## 9 THUNDERSTORM WIND 1488
## 10 HAIL 1360
intersect(deaths$Event, injuries$Event) #How many in common?
## [1] "TORNADO" "EXCESSIVE HEAT" "FLASH FLOOD" "HEAT"
## [5] "LIGHTNING" "TSTM WIND" "FLOOD"
We can see that both lists share most event types and that tornadoes are the worst offender for both deaths and injuries. We’ll use the fatalities list as it is the most severe and create a graph to visualize the relative impact of each event type. First we recombine the data into a useful form for graphing:
cas <- arrange(dat, desc(Deaths), desc(Injuries)) %>% select(Event, Injuries, Deaths) %>% head(n = 10) %>% melt(variable.name = "type", value.name = "casualties")
## Using Event as id variables
And then we create the plot:
p <- ggplot(cas, aes(x = reorder(Event, casualties), y = casualties))
p + geom_bar(stat = "identity", colour = "black", fill = "lightblue") + facet_grid(type ~ ., scales = "free_y") + labs(x = "Event Type", y = "Casualties") + ggtitle("Public Health Impact of Severe Weather Events")
Now we will examine the economic impact of severe weather events. We’ll examine both property damage and crop damage, but since we care about overall economic impact we will be most interested in the combination of both values.
property <- arrange(dat, desc(PropertyDmg)) %>% select(Event, PropertyDmg) %>% head(n = 10)
crop <- arrange(dat, desc(CropDmg)) %>% select(Event, CropDmg) %>% head(n = 10)
property
## Source: local data frame [10 x 2]
##
## Event PropertyDmg
## 1 FLOOD 144657709807
## 2 TORNADO 56936985483
## 3 STORM SURGE 43323536000
## 4 FLASH FLOOD 16140811717
## 5 HAIL 15732262277
## 6 HURRICANE 11868319010
## 7 TROPICAL STORM 7703890550
## 8 WINTER STORM 6688497250
## 9 HIGH WIND 5270046260
## 10 RIVER FLOOD 5118945500
crop
## Source: local data frame [10 x 2]
##
## Event CropDmg
## 1 DROUGHT 13972566000
## 2 FLOOD 5661968450
## 3 RIVER FLOOD 5029459000
## 4 ICE STORM 5022110000
## 5 HAIL 3000954453
## 6 HURRICANE 2741910000
## 7 FLASH FLOOD 1420727100
## 8 EXTREME COLD 1292973000
## 9 FROST/FREEZE 1094086000
## 10 HEAVY RAIN 733399800
dat$TotalDmg <- dat$PropertyDmg + dat$CropDmg #Create a new column for total economic impact
total <- arrange(dat, desc(TotalDmg)) %>% select(Event, TotalDmg) %>% head(n = 10)
total
## Source: local data frame [10 x 2]
##
## Event TotalDmg
## 1 FLOOD 150319678257
## 2 TORNADO 57301935593
## 3 STORM SURGE 43323541000
## 4 HAIL 18733216730
## 5 FLASH FLOOD 17561538817
## 6 DROUGHT 15018672000
## 7 HURRICANE 14610229010
## 8 RIVER FLOOD 10148404500
## 9 ICE STORM 8967037810
## 10 TROPICAL STORM 8382236550
Here, we see that flooding has the greatest impact, followed by Tornadoes. We can once again creat a graph to visualize the impact of each event. First, we recombine the data into a suitable form:
econ <- arrange(dat, desc(PropertyDmg), desc(CropDmg)) %>% select(Event, PropertyDmg, CropDmg) %>% head(n = 10) %>% melt(variable.name = "type", value.name = "cost")
## Using Event as id variables
And then plot the graph:
q <- ggplot(econ, aes(x = reorder(Event, cost), y = cost, fill = type, order = desc(type)))
q + geom_bar(stat = "identity", colour = "black") + labs(x = "Event Type", y = "Total Cost") + scale_fill_brewer(palette = "Blues") + ggtitle("Economic Impact of Severe Weather Events") + theme(axis.text.x = element_text(angle = 30, hjust = 1, vjust = 1))
The economic impact of flooding is clear. Now that we know the impact of tornadoes and flooding, perhaps it would be a good idea to get a better understanding of what regions of the country are most at risk for these events. To do this, first we reorganize our data to filter only tornado and flood events and group it by State. Then we can summarize to find out how many of each event each state has experienced:
tornado <- filter(strm2, EVTYPE == "TORNADO" | EVTYPE == "FLOOD") %>% select(STATE, EVTYPE) %>% group_by(STATE, EVTYPE) %>% summarize(Count = n())
head(tornado)
## Source: local data frame [6 x 3]
## Groups: STATE
##
## STATE EVTYPE Count
## 1 AK FLOOD 208
## 2 AK TORNADO 3
## 3 AL FLOOD 170
## 4 AL TORNADO 2101
## 5 AR FLOOD 525
## 6 AR TORNADO 1987
Next we create a vector of State abbreviations to use when merging datasets:
orderedstates <- c("AL", "AZ", "AR", "CA", "CO", "CT", "DE", "DC", "FL", "GA", "ID", "IL", "IN", "IA", "KS", "KY", "LA", "ME", "MD", "MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ", "NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", "RI", "SC", "SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY")
We’ll only be looking at the lower 48 states, so we must filter our tornado dataset appropriately
tornado <- filter(tornado, STATE %in% orderedstates)
Next we load in map data, plug in our State abbreviations and then merge the datasets.
map <- map_data("state")
map$region <- as.factor(map$region)
levels(map$region) <- orderedstates
## Merge the two datasets
tormap <- merge(tornado, map, by.x = "STATE", by.y = "region")
tormap <- arrange(tormap, group, order) #Must resort so ggplot will draw the polygons appropriately
Finally, we can create the plot:
## Create a clean theme
theme_clean <- function(base_size = 12) {
require(grid)
theme_grey(base_size) %+replace%
theme(
axis.title = element_blank(),
axis.text = element_blank(),
panel.background = element_blank(),
panel.grid = element_blank(),
axis.ticks.length = unit(0, "cm"),
axis.ticks.margin = unit(0, "cm"),
panel.margin = unit(0, "lines"),
plot.margin = unit(c(0, 0, 0, 0), "lines"),
complete = TRUE
)
}
## Create the plot
m <- ggplot(tormap, aes(x = long, y = lat, group = group, fill = Count))
m + geom_polygon(colour = "black") + scale_fill_gradient2(low = "#559999", mid = "grey90", high = "#D55E00", midpoint = median(tormap$Count)) + expand_limits(x = tormap$long, y = tormap$lat) + coord_map("mercator") + facet_grid(EVTYPE ~ .) + theme_clean() + ggtitle("Severe Event Locations")
## Loading required package: grid
We can see that Texas along with most of the South and Mid-West should be prepared to deal with the impact of tornadoes while the midwest should be more prepared for flooding.