Synopsis

The exploratory analysis below is performed on the dataset of NOAA Storm Database. The dataset, after an extensive cleaning, is used for three analyses. First, I take an overview of total impact on population health and economic damage by all storm-related events over time. Second, I identify the events most harmful for population health in United States. Finally, I identify the events causing the biggest economic damage across United States. I briefly discuss each figure.

Data processing

Configuration

Here is the setup for data operations and results, including the libraries:

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
library(reshape2)
library(htmlTable)
sessionInfo()
## R version 3.3.2 (2016-10-31)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: macOS Sierra 10.12.6
## 
## locale:
## [1] C
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] htmlTable_1.9   reshape2_1.4.2  lubridate_1.6.0 ggplot2_2.2.1  
## [5] dplyr_0.7.1    
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.11     knitr_1.16       bindr_0.1        magrittr_1.5    
##  [5] munsell_0.4.3    colorspace_1.3-2 R6_2.2.2         rlang_0.1.1     
##  [9] plyr_1.8.4       stringr_1.2.0    tools_3.3.2      grid_3.3.2      
## [13] checkmate_1.8.3  gtable_0.2.0     htmltools_0.3.6  lazyeval_0.2.0  
## [17] yaml_2.1.14      assertthat_0.2.0 rprojroot_1.2    digest_0.6.12   
## [21] tibble_1.3.3     bindrcpp_0.2     htmlwidgets_0.8  glue_1.1.1      
## [25] evaluate_0.10    rmarkdown_1.6    stringi_1.1.5    scales_0.4.1    
## [29] backports_1.1.0  pkgconfig_2.0.1

Here I’m downloading the storm data file, if it doesn’t exist.

if(!file.exists("Storm Data.csv.bz2")) {
  download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
                destfile = "Storm Data.csv.bz2")
}

Loading

Here I’m creating the table from the bz2 storm data file and store it in the cache file, if it doesn’t exist. Otherwise I load it from the cache. Because of the read.csv() capabilities, I can read the data file directly from the .bz2 format.

if(!file.exists("Storm Data.Rd")) {
  storm.data <- tbl_df(read.csv("Storm Data.csv.bz2"))
  save(storm.data, file = "Storm Data.Rd")
} else {
  load("Storm Data.Rd")
}

Cleaning

Here I perform a small date and time cleaning for BGN/END DATE by changing them to a Date tyoe with the help of lubridate. A small helper function makes it easier to define a clean pipeline. I remove time and time zone information, it is redundant for this analysis.

format_NOAA_date <- function(dates) { 
  new.dates <- sapply(as.character(dates), 
                      function(x) strsplit(x, split = " ")[[1]][1])
  return(mdy(new.dates))
}
storm.data <- mutate(storm.data, 
                      BGN_DATE = format_NOAA_date(BGN_DATE),
                      END_DATE = format_NOAA_date(END_DATE)) %>%
                      select(-BGN_TIME,-END_TIME,-TIME_ZONE)

Here’s the cleaning of PROPDMG and CROPDMG variables, as the number is actually split into two columns, with PROP/CROP_EXP columns containing the notation for hundreds (H), thousands (K), millions (M), billions (B), and explicing numeric encodings of the exponents. I remove the CROP/PROP_EXP columns as redundant afterwards.

format_NOAA_dmg <- function(dmgval,dmgexp) {
  new.dmgval <- dmgval
  this.dmgexp <- toupper(as.character(dmgexp))
  this.dmgexp[this.dmgexp == "H"] <- "2"
  this.dmgexp[this.dmgexp == "K"] <- "3"
  this.dmgexp[this.dmgexp == "M"] <- "6"
  this.dmgexp[this.dmgexp == "B"] <- "9"
  this.dmgexp[this.dmgexp %in% c("+","-","","?")] <- "0"
  new.dmgval <- dmgval*10^as.numeric(this.dmgexp)
  return(new.dmgval)
}
storm.data <- mutate(storm.data, 
                      PROPDMG = format_NOAA_dmg(PROPDMG,PROPDMGEXP),
                      CROPDMG = format_NOAA_dmg(CROPDMG,CROPDMGEXP)) %>%
                      select(-PROPDMGEXP,-CROPDMGEXP)

I need to clean EVTYPE codes, as they are quite a mess. This involves: 1) correction/removal of special characters (e.g. starting or trailing spaces) and converting all terms to uppercase 2) unification of terms (e.g. TSTM equals THUNDERSTORM); I use a defined term_mapping table 3) cleaning related types by grouping, as there are many versions of the same type (e.g. THUNDERSTORM, THUNDERSTORM WINDS, THUNDERSTORMS); the grouing is informed by the documentation for the dataset. I use a defined term_grouping table.

term_mapping <- rbind(
  c("TSTM","THUNDERSTORM"),
  c("THUNDESTORM|THUNERSTORM|THUNDERTORM|THUNDERTSORM|THUNDERESTORM|THUNDERSTROM","THUNDERSTORM"),
  c("WND","WIND"),
  c("WINDSS","WIND"),
  c("WINDS","WIND"),
  c("WINS","WIND"),
  c("CSTL", "COASTAL"),
  c("AVALANCE", "AVALANCHE"),
  c("HVY", "HEAVY"),
  c("EROSIN", "EROSION"),
  c("FLD|FLOODG|FLOODING", "FLOOD"),
  c("PRECIPITATION|PRECIPATATION|SHOWERS|SHOWER", "RAIN")
)

term_grouping <- rbind(
  c("^THUNDERSTORM|THUNDEERSTORM|THUNDERSNOW|MICROBURST|SEVERE THUNDERSTORM","THUNDERSTORM WIND"),
  c("^HURRICANE|TYPHOON","HURRICANE"),
  c("TORNADO","TORNADO"),
  c("^RIP CURRENT","RIP CURRENT"),
  c("^SNOW","HEAVY SNOW"),
  c("^WIND$","STRONG WIND"),
  c("^ICE|^FREEZING RAIN|BLACK ICE|FROST|FREEZE|DAMAGING FREEZE|COOL AND WET","FROST/FREEZE"),
  c("^GLAZE","FREEZING FOG"),
  c("FOREST FIRE|WILDFIRES|WILD FIRES","WILDFIRE"),
  c("COLD|EXTREME WINDCHILL|LOW TEMPERATURE","COLD/WIND CHILL"),
  c("WINTRY MIX|WINTER WEATHER MIX|WINTER WEATHER/MIX|ICY ROADS|FREEZING DRIZZLE|LIGHT SNOW","WINTER WEATHER"),
  c("^WINTER STORM","WINTER STORM"),
  c("^HEAVY SURF|ROUGH SURF","HIGH SURF"),
  c("RECORD HEAT|HEAT WAVE|EXTREME HEAT|RECORD/EXCESSIVE HEAT|^EXCESSIVE HEAT|^UNSEASONABLY WARM","HEAT"),
  c("LANDSLIDE|MUDSLIDE|MUD SLIDE","DEBRIS FLOW"),
  c("^HIGH WIND","HIGH WIND"),
  c("STREAM FLOOD|FLASH FLOO|^URBAN|FLOOD?FLASH|RIVER FLOOD","FLASH FLOOD"),
  c("^FLOOD|LAKESHORE FLOOD|MAJOR FLOOD","FLOOD"),
  c("^BLIZZARD","BLIZZARD"),
  c("HAIL","HAIL"),
  c("DENSE FOG|VOG","FOG"),
  c("DRYNESS|DRY |DROUGHT|EXCESSIVELY DRY","DROUGHT"),
  c("^HEAVY RAIN|EXCESSIVE RAINFALL|MIXED PRECIP|EXCESSIVE WETNESS|UNSEASONAL RAIN|^RAIN$","HEAVY RAIN"),
  c("^HEAVY SNOW|HEAVY WET SNOW|EXCESSIVE SNOW|RECORD SNOW","HEAVY SNOW"),
  c("^TROPICAL STORM","TROPICAL STORM"),
  c("^STORM SURGE$|HIGH SEAS|ROUGH SEAS","STORM SURGE/TIDE"),
  c("COASTAL FLOOD/EROSION|COASTAL STORM|EROSION/COASTAL FLOOD","COASTAL FLOOD"),
  c("LIGHTNING FIRE","LIGHTNING"),
  c("^HYPOTHERMIA/EXPOSURE$|NON-SEVERE WIND DAMAGE|MARINE MISHAP|GUSTY WIND|RAIN/SNOW|HEAVY MIX","OTHER")
)

to_clean <- toupper(as.character(storm.data$EVTYPE))
to_clean <- gsub("\\.$|^ {1,3}|[^A-Z0-9]{1,3}$","",to_clean)
to_clean <- gsub(" {2,20}"," ",to_clean)

for(i in 1:nrow(term_mapping)) { to_clean <- gsub(term_mapping[i,1],term_mapping[i,2],to_clean) }

for(i in 1:nrow(term_grouping)) { to_clean[grep(term_grouping[i,1], to_clean)] <- term_grouping[i,2] }

storm.data <- mutate(storm.data, EVTYPE_CLEAN = as.factor(to_clean))

After loading and cleaning, I trim the dataset by selected columns and I check the distribution of BGN_DATE. It looks like only a portion of the dataset goes back to 1950. I would like to operate on a time span that has data points for more event types. Based on the 1st quartile of the dataset I trim the dates to 1994.

storm.data.tidy <- select(storm.data, STATE, BGN_DATE, 
                                          EVTYPE_CLEAN, FATALITIES, INJURIES, 
                                          PROPDMG, CROPDMG,
                                          LATITUDE,LONGITUDE)
storm.data.tidy %>% 
  group_by(EVTYPE_CLEAN) %>% 
  summarize(minyear = min(year(BGN_DATE))) %>% 
  select(minyear) %>% 
  summary()
##     minyear    
##  Min.   :1950  
##  1st Qu.:1994  
##  Median :1996  
##  Mean   :1996  
##  3rd Qu.:1997  
##  Max.   :2007
storm.data.tidy <- filter(storm.data.tidy, year(BGN_DATE) > 1993)

After trimming, the tidy dataset looks as follows:

str(storm.data.tidy)
## Classes 'tbl_df', 'tbl' and 'data.frame':    702131 obs. of  9 variables:
##  $ STATE       : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ BGN_DATE    : Date, format: "1995-01-06" "1995-01-22" ...
##  $ EVTYPE_CLEAN: Factor w/ 330 levels "","ABNORMAL WARMTH",..: 67 86 67 86 86 103 278 278 280 79 ...
##  $ FATALITIES  : num  0 0 0 0 0 2 0 0 0 0 ...
##  $ INJURIES    : num  0 0 2 0 0 0 0 0 0 0 ...
##  $ PROPDMG     : num  0e+00 0e+00 0e+00 0e+00 0e+00 1e+08 5e+04 5e+06 5e+05 0e+00 ...
##  $ CROPDMG     : num  0e+00 0e+00 0e+00 0e+00 0e+00 1e+07 0e+00 5e+05 0e+00 0e+00 ...
##  $ LATITUDE    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LONGITUDE   : num  0 0 0 0 0 0 0 0 0 0 ...

Results

Overview of the total storm damage in time

First, I’m taking a look at the total population health and economic damage from storms in United States over considered time span (1994-2011).

health_sum <- storm.data.tidy %>% 
  group_by(year = year(BGN_DATE)) %>% 
  summarize(FATALITIES = sum(FATALITIES), INJURIES = sum(INJURIES))
dmg_sum <- storm.data.tidy %>% 
  group_by(year = year(BGN_DATE)) %>% 
  summarize(PROPDMG = sum(PROPDMG), CROPDMG = sum(CROPDMG))
flat_table <- rbind(
  data.frame(year = health_sum$year, numbers = health_sum$FATALITIES, 
             type = "Fatalities", class = "Population health"),
  data.frame(year = health_sum$year, numbers = health_sum$INJURIES, 
             type = "Injuries", class = "Population health"),
  data.frame(year = dmg_sum$year, numbers = dmg_sum$PROPDMG/1000000, 
             type = "Property dmg", class = "Economic dmg (Mln $)"),
  data.frame(year = dmg_sum$year, numbers = dmg_sum$CROPDMG/1000000, 
             type = "Crop dmg", class = "Economic dmg (Mln $)")
)

ggplot(data = flat_table, aes(x = year, y = numbers, color = type)) + geom_point() + geom_line() + facet_grid(class ~ .,scales = "free") + ggtitle("Total health and economic damage in United States from storms")

From the figure above we can see that there is little correlation between economic and health damage. An Outstanding population health damage took place in years 1995, 1998, 2006 and 2011. An outstanding economic damage took place in years 2005 and 2006.

Events harmful for population health

Let’s summarize the events by their impact on the population health. I’m taking a look at the first 30 entries sorted descending according to fatalities and injuries.

health_sum_type <- storm.data.tidy %>% 
  group_by(Event = EVTYPE_CLEAN) %>% 
  summarize(Fatalities = sum(FATALITIES), Injuries = sum(INJURIES)) %>%
  arrange(desc(Fatalities,Injuries))
htmlTable(health_sum_type[1:30,], 
          align = c("r","c","c"), 
          css.cell = "padding-left: 1.5em; padding-right: 1.5em;")
Event Fatalities Injuries
1 HEAT 3169 9236
2 TORNADO 1593 22593
3 FLASH FLOOD 1039 1852
4 LIGHTNING 795 5116
5 RIP CURRENT 577 529
6 COLD/WIND CHILL 456 320
7 FLOOD 455 6779
8 THUNDERSTORM WIND 439 6017
9 HIGH WIND 256 1259
10 AVALANCHE 224 170
11 WINTER STORM 196 1313
12 HIGH SURF 160 245
13 HEAVY SNOW 138 1083
14 HURRICANE 135 1332
15 STRONG WIND 131 386
16 HEAVY RAIN 102 288
17 FROST/FREEZE 98 2074
18 WILDFIRE 87 1456
19 FOG 75 1063
20 BLIZZARD 71 389
21 WINTER WEATHER 70 663
22 TROPICAL STORM 66 383
23 DEBRIS FLOW 44 55
24 TSUNAMI 33 129
25 STORM SURGE/TIDE 25 54
26 DUST STORM 22 439
27 OTHER 22 30
28 MARINE THUNDERSTORM WIND 19 34
29 MARINE STRONG WIND 14 22
30 HAIL 10 953

As we can see, heat caused the most fatalities, while tornadoes are responsible for the most injuries in the given time period.

I will plot only 20 most severe events, and their impact across United States. I will use the coordinates of the geographical middle of the US (42.87 Lat, -97.38 Long) to split data into “quarters”.

health_sum_loc <- storm.data.tidy %>%
  filter(as.character(EVTYPE_CLEAN) %in% as.character(health_sum_type[1:20,]$Event)) %>%
  filter(!is.na(LATITUDE) & !is.na(LONGITUDE)) %>%
  mutate(Sh_Lat = ifelse(LATITUDE > 4287,"North","South"), 
         Sh_Long = ifelse(LONGITUDE > 9738,"West","East")) %>%
    mutate(Sh_Lat = factor(Sh_Lat, levels = c("North","South")), 
         Sh_Long = factor(Sh_Long,levels = c("West","East"))) %>%
  group_by(Event = EVTYPE_CLEAN, Sh_Lat, Sh_Long) %>%
  summarize(Fatalities = sum(FATALITIES), Injuries = sum(INJURIES))

ggplot(data = melt(health_sum_loc), aes(x = reorder(Event,value), y = value, fill = variable)) + geom_bar(position = position_dodge(), stat="identity") + facet_grid(Sh_Lat ~ Sh_Long) + coord_flip() + scale_y_continuous(name = "Number of accidents") + scale_x_discrete(name = "Names of events") + ggtitle(label = "Number of accicents affecting population health for the storm events with most impact, across United States")
## Using Event, Sh_Lat, Sh_Long as id variables

From the figure above we can see that the South East of the United States is the most affected area, although tornadoes are affecting also the South West and Nort East. The most harmful events are tornadoes, heat, thunderstorm winds and floods.

Events causing property damage

Let’s summarize the events by their impact on the economic damage. I’m taking a look at the first 30 entries sorted descending according to summarized property and crop damage. I scale the costs to millions of USD.

property_sum_type <- storm.data.tidy %>% 
  group_by(Event = EVTYPE_CLEAN) %>% 
  summarize(Property.dmg = sum(PROPDMG)/1000000, 
            Crop.dmg = sum(CROPDMG)/1000000) %>%
  arrange(desc(Property.dmg+Crop.dmg)) %>%
  mutate(Property.dmg = formatC(Property.dmg, format="f", digits=2, big.mark=","),
         Crop.dmg = formatC(Crop.dmg, format="f", digits=2, big.mark=","))
htmlTable(property_sum_type[1:30,],
          header = c("Event", "Property dmg, mln USD", "Crop dmg, mln USD"),
          align = c("r","c","c"), 
          css.cell = "padding-left: 5.5em; padding-right: 5.5em;")
Event Property dmg, mln USD Crop dmg, mln USD
1 FLOOD 144,306.88 5,628.49
2 HURRICANE 85,300.91 5,515.62
3 STORM SURGE/TIDE 47,834.74 0.85
4 TORNADO 25,636.48 361.83
5 FLASH FLOOD 17,404.43 1,574.88
6 HAIL 15,580.34 3,003.63
7 DROUGHT 1,046.11 13,922.07
8 THUNDERSTORM WIND 10,746.66 1,263.35
9 FROST/FREEZE 3,869.21 6,913.17
10 TROPICAL STORM 7,713.89 694.39
11 WILDFIRE 7,877.56 403.28
12 HIGH WIND 5,416.09 668.30
13 HEAVY RAIN 3,224.99 943.51
14 WINTER STORM 1,620.10 16.94
15 COLD/WIND CHILL 129.73 1,426.77
16 HEAT 20.33 904.47
17 HEAVY SNOW 774.60 129.18
18 LIGHTNING 878.44 11.99
19 BLIZZARD 527.01 7.06
20 COASTAL FLOOD 438.05 0.06
21 DEBRIS FLOW 326.53 20.02
22 STRONG WIND 180.68 70.75
23 TSUNAMI 144.06 0.02
24 HIGH SURF 100.83 0.00
25 WINTER WEATHER 30.39 15.00
26 LAKE-EFFECT SNOW 40.12 0.00
27 FOG 22.22 0.00
28 ASTRONOMICAL HIGH TIDE 9.43 0.00
29 DUST STORM 5.54 3.10
30 WATERSPOUT 7.30 0.00

As we can see, floods and hurricanes are the most damaging events economically.

Again, I will plot only 20 most severe events, and their impact across United States. I will use the coordinates of the geographical middle of the US (42.87 Lat, -97.38 Long) to split data into “quarters”.

property_sum_loc <- storm.data.tidy %>%
  filter(as.character(EVTYPE_CLEAN) %in% as.character(property_sum_type[1:20,]$Event)) %>%
  filter(!is.na(LATITUDE) & !is.na(LONGITUDE)) %>%
  mutate(Sh_Lat = ifelse(LATITUDE > 4287,"North","South"), 
         Sh_Long = ifelse(LONGITUDE > 9738,"West","East")) %>%
    mutate(Sh_Lat = factor(Sh_Lat, levels = c("North","South")), 
         Sh_Long = factor(Sh_Long,levels = c("West","East"))) %>%
  group_by(Event = EVTYPE_CLEAN, Sh_Lat, Sh_Long) %>%
  summarize(Property.dmg = sum(PROPDMG)/1000000, Crop.dmg = sum(CROPDMG)/1000000)

ggplot(data = melt(property_sum_loc), aes(x = reorder(Event,value), y = value, fill = variable)) + geom_bar(position = position_dodge(), stat="identity") + facet_grid(Sh_Lat ~ Sh_Long) + coord_flip() + scale_y_continuous(name = "Costs of damage in millions $") + scale_x_discrete(name = "Names of events") + ggtitle(label = "Costs of economic damage for the storm events with most impact, across United States")
## Using Event, Sh_Lat, Sh_Long as id variables

From the figure above we can see that the South East and South West of the United States are the most affected area. Flood in the South West generated the highest property costs across the United States in the analyzed period. In the South East hurricanes and storm surge/tide events generated the biggest economic burden. Interestingly, these events do not correlate well with the events affecting population health (tornadoes, heat), altohugh the top ten lists overlap.