StoRm Data Analysis

Overview

Hi, thanks for reading my strategy for analysing the National Weather Service’s Storm Data disaster dataset. This dataset is an aggregate of weather events across the United States, since 1950. It describes location, time, type and damage associated with an adverse weather event. We are going to use it for total dollar damage and total casualties.

This dataset is quite large and manipulating it will take some CPU time. Since I’ve had some trouble working with cache = TRUE. I’ve used saveRDS() and readRDS to help shorten my execution time knitting.

Also I use the %>%, and %<>% “pipe” operators from library(magrittr) pervasively thought my code. These are just simple short hand used like this:

# identical results
split(mtcars, mtcars$cyl)
mtcars %>% split(.$cyl)
# %>% takes the LHS argument and passes it as the first argument to RHS
# the . notation represents the element being passed in from the LHS
mtcars %<>% split(.$cyl)
# reciprocal pipe stores the value of RHS evaluation, overwriting the original LHS variable

Data Processing:

First, we need to read in our 47mB zipped data file as a data.frame. Fortunately read.csv allows us to do this directly without unpacking the file first.

setwd("~/datasciencecoursera/Reproducible_Research/")
datf <- read.csv("repdata%2Fdata%2FStormData.csv.bz2")

The column datf$EVTYPE describes the type of adverse weather event causing damage. If we call table(daft$EVTYPE) we can see we have A LOT of different events logged. To do a good analysis we want to clean this up a bit. I started by converting to lower case to eliminate any differences due to capitalization, then checking out the percentage of the data set stored in the 15 most frequent results from table(datf$EVTYPE):

datf$EVTYPE %<>% tolower()
table(datf$EVTYPE) %>% sort(decreasing = T) %>% .[1:15] %>% sum() / nrow(datf)

## [1] 0.9389159

So right away we have ~94% of our dataset represented in the top 15 tags, that pretty good and probably good enough to get a good idea of what type of weather events are the most destructive. But I went a little overboard with my clean up and tried to collapse common tag variations into single tag names, uniting events like “tropical storm” and “hurricane” into a single tag, “tropical_storm”. I did this manually with gsub() and referencing table(datf$EVTYPE) %>% sort(decreasing = T) %>% .[1:50] frequently.

# collapse tags by hand, recycling the above table call
# flood
datf$EVTYPE %<>% gsub(".*flood.*|.*fld.*", "flood", .)
# hail
datf$EVTYPE %<>% gsub(".*hail.*", "hail", .)
# fire
datf$EVTYPE %<>% gsub(".*fire.*", "fire", .)
# wind
datf$EVTYPE %<>% gsub(".*wind.*", "wind", .)
# tropical_storm
datf$EVTYPE %<>% gsub(".*tropical.*|.*hurricane.|.*surf.*|.*surge.*", "tropical_storm", .)
# winter_storm
datf$EVTYPE %<>% gsub(".*blizzard.*|.*winter.*|.*snow.*|.*wintry.*|.*freezing.*|.*ice.*|.*sleet.*", "winter_storm", .)
# fog
datf$EVTYPE %<>% gsub(".*fog.*", "fog", .)
# heat
datf$EVTYPE %<>% gsub(".*heat.*|.*warmth.*", "heat", .)
# cold
datf$EVTYPE %<>% gsub(".*cold.*|.*icy.*|.*frezze.*|.*frost.*", "cold", .)
# dust
datf$EVTYPE %<>% gsub(".*dust.*", "dust", .)
# rip_current
datf$EVTYPE %<>% gsub(".*rip.*", "rip_current", .)

# now check how much data the top 15 tags cover
table(datf$EVTYPE) %>% sort(decreasing = T) %>% .[1:15] %>% sum() / nrow(datf)

## [1] 0.9948664

Well 99.5% coverage is good enough for me and likely I could have gotten away with a less gsub() calls, but it’s clean enough so lets keep moving.

The dollar damage data is stored separately for properties and crops. Each is seperated into two columns one for the “base” value(datf$PROPDMG) and one for the “multiplier” (PROPDMGEXP). We need to clean datf$PROPDMGEXP and datf$CROPDMGEXP so that all values are numeric and not 0. So that means decoding letters like “K”, “M” and “B” into their numeric values 1000, 1000000, and 1000000000. I like using named vectors for decoding scenarios like this.

datf$PROPDMGEXP %<>% as.character() %>% toupper()
exp_decode <- c(1000, 1000000, 1000000000)
names(exp_decode) <- c("K", "M", "B")
datf %<>% mutate(PROPDMGEXP = ifelse(PROPDMGEXP %in% names(exp_decode),
                                     exp_decode[PROPDMGEXP],
                                     PROPDMGEXP))
datf$PROPDMGEXP %<>% as.numeric()
# now we can't have any 0's in our multipliers
datf %<>% mutate(PROPDMGEXP = ifelse(is.na(PROPDMGEXP) | PROPDMGEXP == 0,
                                     1,
                                     PROPDMGEXP))
# combine the multipliers into prop_damage (one large value)
datf %<>% mutate(prop_damage = PROPDMG*PROPDMGEXP)

# repeat for CROPDMGEXP
# fix crop damage
datf$CROPDMGEXP %<>% as.character() %>% toupper()
exp_decode <- c(1000, 1000000, 1000000000)
names(exp_decode) <- c("K", "M", "B")
datf %<>% mutate(CROPDMGEXP = ifelse(CROPDMGEXP %in% names(exp_decode),
                                     exp_decode[CROPDMGEXP],
                                     CROPDMGEXP))
datf$CROPDMGEXP %<>% as.numeric()
datf %<>% mutate(CROPDMGEXP = ifelse(is.na(CROPDMGEXP) | CROPDMGEXP == 0,
                                     1,
                                     CROPDMGEXP))
# combine into crop_damage and total_damage
datf %<>% mutate(crop_damage = CROPDMG*CROPDMGEXP,
                 total_damage = crop_damage + prop_damage)

The warnings are okay because we handle them with the ifelse substitution immediately afterwards. Now that we have a nice representation of the fiscal damage caused by weather events with: prop_damage, crop_damage and total_damage, we can forge ahead to working on datf$FATALITIES ans datf$INJURIES.

These don’t require the same ammount of innervention as above. All I wanted to do was build datf$casualties, to summarise the total number of injuries (including death) for an given record.

datf %<>% mutate(casualties = INJURIES + FATALITIES)

Result Tables

With our data cleaned up, lets summarise our data by EVTYPE and calculate the total financial and total health impact of each event type.

table_sum <- datf %>% group_by(EVTYPE) %>% 
    summarise_at(vars(casualties, FATALITIES, INJURIES, prop_damage, crop_damage, total_damage),
                 sum)

## Warning: Grouping rowwise data frame strips rowwise nature

That warning is okay and it’s just letting us know that our grouping is alterning R’s normal rowwise behavior, we will see the same warning later with group_by calls.

Now that we have our summary data frame codensed by EVTYPE lets do two quick sorts on our table_sum, to see the most damaging events for monetary loss and health loss. Sorry, I’m using tables to save my figures for some awesome map visualizations :)

# largest health impact
table_sum %>% arrange(-casualties)

## # A tibble: 15 × 7
##            EVTYPE casualties FATALITIES INJURIES  prop_damage crop_damage
##             <chr>      <dbl>      <dbl>    <dbl>        <dbl>       <dbl>
## 1         tornado      96979       5633    91346  56936689188   414952270
## 2            wind      12665       1383    11282  16059457309  1978785138
## 3            heat      12024       3116     8908     20325750   904469280
## 4           flood      10173       1503     8670 167185715900 12326185200
## 5    winter_storm       6720        651     6069  12369740320  5311281400
## 6       lightning       6018        806     5212    927616016    12092090
## 7            fire       1697         90     1607   8494501500   402116630
## 8            hail       1510         45     1465  17619895192  3114210873
## 9             fog       1158         81     1077     22829500           0
## 10 tropical_storm        862        216      646  55554565550   592050000
## 11           cold        529        216      313    135727650  2611301550
## 12     heavy rain        297         66      231    690605060   733289800
## 13     waterspout         31          3       28      4120000           0
## 14        drought          4          0        4   1041106000 13972361000
## 15   funnel cloud          3          0        3       194600           0
## # ... with 1 more variables: total_damage <dbl>

We can see that tornados historically have been the most dangerous type of weather event in terms of casualties, by far. This make sense because tornados can appear quickly and do tremendos damage in a short ammount of time, leaving people with little to no time to prepare or evacuate. Other adverse weather events often come with forcasts and predictions ranging from days to weeks, allowing populations and localities to make the necessary preparations to minimize the loss of life.

# largest dollar impact
table_sum %>% arrange(-total_damage) %>% select(EVTYPE, total_damage, everything())

## # A tibble: 15 × 7
##            EVTYPE total_damage casualties FATALITIES INJURIES  prop_damage
##             <chr>        <dbl>      <dbl>      <dbl>    <dbl>        <dbl>
## 1           flood 179511901100      10173       1503     8670 167185715900
## 2         tornado  57351641458      96979       5633    91346  56936689188
## 3  tropical_storm  56146615550        862        216      646  55554565550
## 4            hail  20734106065       1510         45     1465  17619895192
## 5            wind  18038242447      12665       1383    11282  16059457309
## 6    winter_storm  17681021720       6720        651     6069  12369740320
## 7         drought  15013467000          4          0        4   1041106000
## 8            fire   8896618130       1697         90     1607   8494501500
## 9            cold   2747029200        529        216      313    135727650
## 10     heavy rain   1423894860        297         66      231    690605060
## 11      lightning    939708106       6018        806     5212    927616016
## 12           heat    924795030      12024       3116     8908     20325750
## 13            fog     22829500       1158         81     1077     22829500
## 14     waterspout      4120000         31          3       28      4120000
## 15   funnel cloud       194600          3          0        3       194600
## # ... with 1 more variables: crop_damage <dbl>

In terms of financial impact we see that floods displace tornados as the most costly type of weather. There are lots of ways floods can be generated (tropical stroms, thunderstorms, etc), and while people can evacuate in flood scenarios, structures and crops can not. I’m not suprised by this, because while floods are infrequent they are capable damaging entire rivershead areas, stretching across mutliple states.

Result Maps

Now for the “cool” visualizations. I wanted to explore the both the financial and health impact across the US in terms of regions. This is interesting because it can help identify the areas that are more likely to get “hit” and hopefully those states could allocate more budget to disaster training, infrastructure improvments and rescue teams.

Here is the code chuck required to get our map plots ready:

datf %<>% filter(STATE %in% state.abb)
datf %<>% rowwise() %>% mutate(stateName = tolower(state.name[grep(STATE, state.abb)]))

state_summary <- datf %>% group_by(stateName) %>%
    summarise_at(vars(casualties, FATALITIES, INJURIES, prop_damage, crop_damage, total_damage),
                 sum)

## Warning: Grouping rowwise data frame strips rowwise nature

Our large weather data frame has many localities beyond the 50 US states includes (in low frequencies), but since our map call only recognizes states, we needed to pair down. Also we are using the two letter state abbreviations to key between the map raster object and our data, so we took advantage of the built in state.abb to convert from datf$STATE.

Now for the magic with ggplot, let’s look at causualties first:

library(maps) # this where we get the map raster objects with functions like `map_data()`
states_map <- map_data("state") # only need to read this in once, but it's used in all 3 maps

#casualties
ggplot(state_summary, aes(map_id = stateName, fill = casualties)) + 
    geom_map(map = states_map, color = "black", na.rm = T) +
    scale_fill_gradient(name = "Casualties", low = "white", high = "firebrick") +
    expand_limits(x = states_map$long, y = states_map$lat) +
    coord_map("polyconic") +
    labs(title = "Total casualties by U.S. State since 1950",
         subtitle = "Fatalies + Injuries",
         caption = "source: NWSI 10-1605 AUGUST 17, 2007") +
    theme_map()

## Loading required package: grid

We can see that Texas is hit the hardest of any states. This makes sensse with their location, being close to the Gulf of Mexico and it’s hurricanes and thunderstorms, which also being at the southern end of “Tornado Alley” with is frequent twisters and high winds. After Texas it looks like the Gulf states (AL, MI) and those along the Missisippi River (AR, IL, MO) also get hit more often. The NorthWest and NorthEast of the country are almost white, so they rarely have casualties as a result of weather.

Next we can do the same style of plot, this time filling by total dollar damage.

#total_damage
ggplot(state_summary, aes(map_id = stateName, fill = total_damage/1000000)) + 
    geom_map(map = states_map, color = "black", na.rm = T) +
    scale_fill_gradient(name = "millions $USD", low = "white", high = "steelblue") +
    expand_limits(x = states_map$long, y = states_map$lat) +
    coord_map("polyconic") +
    labs(title = "Total Damage by U.S. State since 1950",
         subtitle = "Property + Crop",
         caption = "source: NWSI 10-1605 AUGUST 17, 2007") +
    theme_map()

## Warning: `panel.margin` is deprecated. Please use `panel.spacing` property
## instead

We can see that California takes more damage than any other state. This makes sense if you think about how large California’s economy is. Fortune magazine reportes on June 17, 2016 that California passed France as the 6th largest economy in the world ! (http://fortune.com/2016/06/17/california-france-6th-largest-economy/). Thats a lot of money and material out in CA so it is no suprise that when bad weather strikes the repair bill is high. After CA, we see a patch of blue surrounding the Gulf of Mexico and this is likely due to tropical stroms hitting that region frequently and ferociously.

Finally I wanted to look that the most frequent and the most damaging adverse weather event for each state.

# summarise events by count
state_summary2 <- datf %>% group_by(stateName) %>%
    count(EVTYPE) %>% filter(n == max(n))

# most frequent EVTYPE
ggplot(state_summary2, aes(map_id = stateName, fill = EVTYPE)) + 
    geom_map(map = states_map, color = "black", na.rm = T) +
    scale_fill_brewer(name = "Event Type", palette = "Dark2") +
    expand_limits(x = states_map$long, y = states_map$lat) +
    coord_map("polyconic") +
    labs(title = "Most Frequent Disaster Event by U.S. State",
         subtitle = "Measure as highest count since 1950",
         caption = "source: NWSI 10-1605 AUGUST 17, 2007") +
    theme_map()

Wind is the most common weather source for damage across the US and I am sure everyone knows someone that has had a tree get knocked down by wind. Wind is often responsible for power outages as falling trees destroy power lines. Hail is very common in the central corridor of the country, which was a little suprising for me, I thought it would be tornados. But what do I know, I live on the East Coast. Interesting outliers are Utah with a lot of winter storms and Hawaii with a lot of heavy rain. Alaska has wind as the most frequent bad weather event.

# most damageing weather event
state_summary3 <- datf %>% group_by(stateName, EVTYPE) %>%
    summarise(damage = sum(total_damage), casualties = sum(casualties)) %>%
    ungroup() %>% group_by(stateName) %>% arrange(-damage) %>% 
    slice(1)

ggplot(state_summary3, aes(map_id = stateName, fill = EVTYPE)) + 
    geom_map(map = states_map, color = "black", na.rm = T) +
    scale_fill_brewer(name = "Event Type", palette = "Dark2") +
    expand_limits(x = states_map$long, y = states_map$lat) +
    coord_map("polyconic") +
    labs(title = "Most Damaging Weather Event by US State",
         subtitle = "Measure as Total $ Damage since 1950",
         caption = "source: NWSI 10-1605 AUGUST 17, 2007") +
    theme_map()

When we turn too look at the most damaging event by US state we see a lot of difference based on region. Maine gets the worst of it weather from winter storms, not suprising, and the Gulf states take their biggest hits from tropical storms. Tornodoes and flooding are common accross the country and that make sense seeing how destrtuctive those two types of events were in our tables above. Interesting to see New Mexico takes the most damage from fires. Alaska and Hawaii both have floods as their most damaging type of weather.

Thanks for reading my report. I know I included 4 figures, please be merciful, I just thought the data deserved those nice map representations!

I’ve included the code for theme_map() in the Appendix. If you have any questions or comments, my email is nathancday@gmail.com. Happy to code and happy to help!

Appendix

Here is the code associated with theme_map(), which was graciously written and shared on github by François Briatte (https://gist.github.com/briatte/4718656). God bless the Open Source!

theme_osa = function(base_size=9, base_family="")
{
    require(grid)
    theme_bw(base_size=base_size, base_family=base_family) %+replace%
        theme(axis.title.x=element_text(vjust=0),
              axis.title.y=element_text(angle=90, vjust=0.3),
              axis.text=element_text(),
              axis.ticks=element_line(colour="black", size=0.25),
              legend.background=element_rect(fill=NA, colour=NA),
              legend.direction="vertical",
              legend.key=element_rect(fill=NA, colour="white"),
              legend.text=element_text(),
              legend.title=element_text(face="bold", hjust=0),
              panel.border=element_rect(fill=NA, colour="black"),
              panel.grid.major=element_line(colour="grey92", size=0.3, linetype=1),
              panel.grid.minor=element_blank(),
              plot.title=element_text(vjust=1),
              strip.background=element_rect(fill="grey90", colour="black", size=0.3),
              strip.text=element_text()
        )
}
theme_map = function(base_size=9, base_family="")
{
    require(grid)
    theme_osa(base_size=base_size, base_family=base_family) %+replace%
        theme(axis.line=element_blank(),
              axis.text.x=element_blank(),
              axis.text.y=element_blank(),
              axis.ticks=element_blank(),
              axis.title.x=element_blank(),
              axis.title.y=element_blank(),
              legend.justification = c(0,0), # bottom of box
              legend.position      = c(0,0), # bottom of picture
              panel.background=element_blank(),
              panel.border=element_rect(colour = "grey90", size = 1, fill = NA),
              panel.grid.major=element_blank(),
              panel.grid.minor=element_blank(),
              panel.margin=unit(0, "lines"),
              plot.background=element_blank()
        )
}