Week4 Assignment: Exploratory analysis of NOAA Storm Database. Impact on population health and property damage.

Synopsis

The exploratory analysis below is performed on the dataset of NOAA Storm Database. The dataset, after an extensive cleaning, is used for three analyses. First, I take an overview of total impact on population health and economic damage by all storm-related events over time. Second, I identify the events most harmful for population health in United States. Finally, I identify the events causing the biggest economic damage across United States. I briefly discuss each figure.

Data processing

Configuration

Here is the setup for data operations and results, including the libraries:

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(lubridate)

## 
## Attaching package: 'lubridate'

## The following object is masked from 'package:base':
## 
##     date

library(reshape2)
library(htmlTable)
sessionInfo()

## R version 3.3.2 (2016-10-31)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: macOS Sierra 10.12.6
## 
## locale:
## [1] C
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] htmlTable_1.9   reshape2_1.4.2  lubridate_1.6.0 ggplot2_2.2.1  
## [5] dplyr_0.7.1    
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.11     knitr_1.16       bindr_0.1        magrittr_1.5    
##  [5] munsell_0.4.3    colorspace_1.3-2 R6_2.2.2         rlang_0.1.1     
##  [9] plyr_1.8.4       stringr_1.2.0    tools_3.3.2      grid_3.3.2      
## [13] checkmate_1.8.3  gtable_0.2.0     htmltools_0.3.6  lazyeval_0.2.0  
## [17] yaml_2.1.14      assertthat_0.2.0 rprojroot_1.2    digest_0.6.12   
## [21] tibble_1.3.3     bindrcpp_0.2     htmlwidgets_0.8  glue_1.1.1      
## [25] evaluate_0.10    rmarkdown_1.6    stringi_1.1.5    scales_0.4.1    
## [29] backports_1.1.0  pkgconfig_2.0.1

Here I’m downloading the storm data file, if it doesn’t exist.

if(!file.exists("Storm Data.csv.bz2")) {
  download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
                destfile = "Storm Data.csv.bz2")
}

Loading

Here I’m creating the table from the bz2 storm data file and store it in the cache file, if it doesn’t exist. Otherwise I load it from the cache. Because of the read.csv() capabilities, I can read the data file directly from the .bz2 format.

if(!file.exists("Storm Data.Rd")) {
  storm.data <- tbl_df(read.csv("Storm Data.csv.bz2"))
  save(storm.data, file = "Storm Data.Rd")
} else {
  load("Storm Data.Rd")
}

Cleaning

Here I perform a small date and time cleaning for BGN/END DATE by changing them to a Date tyoe with the help of lubridate. A small helper function makes it easier to define a clean pipeline. I remove time and time zone information, it is redundant for this analysis.

format_NOAA_date <- function(dates) { 
  new.dates <- sapply(as.character(dates), 
                      function(x) strsplit(x, split = " ")[[1]][1])
  return(mdy(new.dates))
}
storm.data <- mutate(storm.data, 
                      BGN_DATE = format_NOAA_date(BGN_DATE),
                      END_DATE = format_NOAA_date(END_DATE)) %>%
                      select(-BGN_TIME,-END_TIME,-TIME_ZONE)

Here’s the cleaning of PROPDMG and CROPDMG variables, as the number is actually split into two columns, with PROP/CROP_EXP columns containing the notation for hundreds (H), thousands (K), millions (M), billions (B), and explicing numeric encodings of the exponents. I remove the CROP/PROP_EXP columns as redundant afterwards.

format_NOAA_dmg <- function(dmgval,dmgexp) {
  new.dmgval <- dmgval
  this.dmgexp <- toupper(as.character(dmgexp))
  this.dmgexp[this.dmgexp == "H"] <- "2"
  this.dmgexp[this.dmgexp == "K"] <- "3"
  this.dmgexp[this.dmgexp == "M"] <- "6"
  this.dmgexp[this.dmgexp == "B"] <- "9"
  this.dmgexp[this.dmgexp %in% c("+","-","","?")] <- "0"
  new.dmgval <- dmgval*10^as.numeric(this.dmgexp)
  return(new.dmgval)
}
storm.data <- mutate(storm.data, 
                      PROPDMG = format_NOAA_dmg(PROPDMG,PROPDMGEXP),
                      CROPDMG = format_NOAA_dmg(CROPDMG,CROPDMGEXP)) %>%
                      select(-PROPDMGEXP,-CROPDMGEXP)

I need to clean EVTYPE codes, as they are quite a mess. This involves: 1) correction/removal of special characters (e.g. starting or trailing spaces) and converting all terms to uppercase 2) unification of terms (e.g. TSTM equals THUNDERSTORM); I use a defined term_mapping table 3) cleaning related types by grouping, as there are many versions of the same type (e.g. THUNDERSTORM, THUNDERSTORM WINDS, THUNDERSTORMS); the grouing is informed by the documentation for the dataset. I use a defined term_grouping table.

term_mapping <- rbind(
  c("TSTM","THUNDERSTORM"),
  c("THUNDESTORM|THUNERSTORM|THUNDERTORM|THUNDERTSORM|THUNDERESTORM|THUNDERSTROM","THUNDERSTORM"),
  c("WND","WIND"),
  c("WINDSS","WIND"),
  c("WINDS","WIND"),
  c("WINS","WIND"),
  c("CSTL", "COASTAL"),
  c("AVALANCE", "AVALANCHE"),
  c("HVY", "HEAVY"),
  c("EROSIN", "EROSION"),
  c("FLD|FLOODG|FLOODING", "FLOOD"),
  c("PRECIPITATION|PRECIPATATION|SHOWERS|SHOWER", "RAIN")
)

term_grouping <- rbind(
  c("^THUNDERSTORM|THUNDEERSTORM|THUNDERSNOW|MICROBURST|SEVERE THUNDERSTORM","THUNDERSTORM WIND"),
  c("^HURRICANE|TYPHOON","HURRICANE"),
  c("TORNADO","TORNADO"),
  c("^RIP CURRENT","RIP CURRENT"),
  c("^SNOW","HEAVY SNOW"),
  c("^WIND$","STRONG WIND"),
  c("^ICE|^FREEZING RAIN|BLACK ICE|FROST|FREEZE|DAMAGING FREEZE|COOL AND WET","FROST/FREEZE"),
  c("^GLAZE","FREEZING FOG"),
  c("FOREST FIRE|WILDFIRES|WILD FIRES","WILDFIRE"),
  c("COLD|EXTREME WINDCHILL|LOW TEMPERATURE","COLD/WIND CHILL"),
  c("WINTRY MIX|WINTER WEATHER MIX|WINTER WEATHER/MIX|ICY ROADS|FREEZING DRIZZLE|LIGHT SNOW","WINTER WEATHER"),
  c("^WINTER STORM","WINTER STORM"),
  c("^HEAVY SURF|ROUGH SURF","HIGH SURF"),
  c("RECORD HEAT|HEAT WAVE|EXTREME HEAT|RECORD/EXCESSIVE HEAT|^EXCESSIVE HEAT|^UNSEASONABLY WARM","HEAT"),
  c("LANDSLIDE|MUDSLIDE|MUD SLIDE","DEBRIS FLOW"),
  c("^HIGH WIND","HIGH WIND"),
  c("STREAM FLOOD|FLASH FLOO|^URBAN|FLOOD?FLASH|RIVER FLOOD","FLASH FLOOD"),
  c("^FLOOD|LAKESHORE FLOOD|MAJOR FLOOD","FLOOD"),
  c("^BLIZZARD","BLIZZARD"),
  c("HAIL","HAIL"),
  c("DENSE FOG|VOG","FOG"),
  c("DRYNESS|DRY |DROUGHT|EXCESSIVELY DRY","DROUGHT"),
  c("^HEAVY RAIN|EXCESSIVE RAINFALL|MIXED PRECIP|EXCESSIVE WETNESS|UNSEASONAL RAIN|^RAIN$","HEAVY RAIN"),
  c("^HEAVY SNOW|HEAVY WET SNOW|EXCESSIVE SNOW|RECORD SNOW","HEAVY SNOW"),
  c("^TROPICAL STORM","TROPICAL STORM"),
  c("^STORM SURGE$|HIGH SEAS|ROUGH SEAS","STORM SURGE/TIDE"),
  c("COASTAL FLOOD/EROSION|COASTAL STORM|EROSION/COASTAL FLOOD","COASTAL FLOOD"),
  c("LIGHTNING FIRE","LIGHTNING"),
  c("^HYPOTHERMIA/EXPOSURE$|NON-SEVERE WIND DAMAGE|MARINE MISHAP|GUSTY WIND|RAIN/SNOW|HEAVY MIX","OTHER")
)

to_clean <- toupper(as.character(storm.data$EVTYPE))
to_clean <- gsub("\\.$|^ {1,3}|[^A-Z0-9]{1,3}$","",to_clean)
to_clean <- gsub(" {2,20}"," ",to_clean)

for(i in 1:nrow(term_mapping)) { to_clean <- gsub(term_mapping[i,1],term_mapping[i,2],to_clean) }

for(i in 1:nrow(term_grouping)) { to_clean[grep(term_grouping[i,1], to_clean)] <- term_grouping[i,2] }

storm.data <- mutate(storm.data, EVTYPE_CLEAN = as.factor(to_clean))

After loading and cleaning, I trim the dataset by selected columns and I check the distribution of BGN_DATE. It looks like only a portion of the dataset goes back to 1950. I would like to operate on a time span that has data points for more event types. Based on the 1st quartile of the dataset I trim the dates to 1994.

storm.data.tidy <- select(storm.data, STATE, BGN_DATE, 
                                          EVTYPE_CLEAN, FATALITIES, INJURIES, 
                                          PROPDMG, CROPDMG,
                                          LATITUDE,LONGITUDE)
storm.data.tidy %>% 
  group_by(EVTYPE_CLEAN) %>% 
  summarize(minyear = min(year(BGN_DATE))) %>% 
  select(minyear) %>% 
  summary()

##     minyear    
##  Min.   :1950  
##  1st Qu.:1994  
##  Median :1996  
##  Mean   :1996  
##  3rd Qu.:1997  
##  Max.   :2007

storm.data.tidy <- filter(storm.data.tidy, year(BGN_DATE) > 1993)

After trimming, the tidy dataset looks as follows:

str(storm.data.tidy)

## Classes 'tbl_df', 'tbl' and 'data.frame':    702131 obs. of  9 variables:
##  $ STATE       : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ BGN_DATE    : Date, format: "1995-01-06" "1995-01-22" ...
##  $ EVTYPE_CLEAN: Factor w/ 330 levels "","ABNORMAL WARMTH",..: 67 86 67 86 86 103 278 278 280 79 ...
##  $ FATALITIES  : num  0 0 0 0 0 2 0 0 0 0 ...
##  $ INJURIES    : num  0 0 2 0 0 0 0 0 0 0 ...
##  $ PROPDMG     : num  0e+00 0e+00 0e+00 0e+00 0e+00 1e+08 5e+04 5e+06 5e+05 0e+00 ...
##  $ CROPDMG     : num  0e+00 0e+00 0e+00 0e+00 0e+00 1e+07 0e+00 5e+05 0e+00 0e+00 ...
##  $ LATITUDE    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LONGITUDE   : num  0 0 0 0 0 0 0 0 0 0 ...

Results

Overview of the total storm damage in time

First, I’m taking a look at the total population health and economic damage from storms in United States over considered time span (1994-2011).

health_sum <- storm.data.tidy %>% 
  group_by(year = year(BGN_DATE)) %>% 
  summarize(FATALITIES = sum(FATALITIES), INJURIES = sum(INJURIES))
dmg_sum <- storm.data.tidy %>% 
  group_by(year = year(BGN_DATE)) %>% 
  summarize(PROPDMG = sum(PROPDMG), CROPDMG = sum(CROPDMG))
flat_table <- rbind(
  data.frame(year = health_sum$year, numbers = health_sum$FATALITIES, 
             type = "Fatalities", class = "Population health"),
  data.frame(year = health_sum$year, numbers = health_sum$INJURIES, 
             type = "Injuries", class = "Population health"),
  data.frame(year = dmg_sum$year, numbers = dmg_sum$PROPDMG/1000000, 
             type = "Property dmg", class = "Economic dmg (Mln $)"),
  data.frame(year = dmg_sum$year, numbers = dmg_sum$CROPDMG/1000000, 
             type = "Crop dmg", class = "Economic dmg (Mln $)")
)

ggplot(data = flat_table, aes(x = year, y = numbers, color = type)) + geom_point() + geom_line() + facet_grid(class ~ .,scales = "free") + ggtitle("Total health and economic damage in United States from storms")

From the figure above we can see that there is little correlation between economic and health damage. An Outstanding population health damage took place in years 1995, 1998, 2006 and 2011. An outstanding economic damage took place in years 2005 and 2006.

Events harmful for population health

Let’s summarize the events by their impact on the population health. I’m taking a look at the first 30 entries sorted descending according to fatalities and injuries.

health_sum_type <- storm.data.tidy %>% 
  group_by(Event = EVTYPE_CLEAN) %>% 
  summarize(Fatalities = sum(FATALITIES), Injuries = sum(INJURIES)) %>%
  arrange(desc(Fatalities,Injuries))
htmlTable(health_sum_type[1:30,], 
          align = c("r","c","c"), 
          css.cell = "padding-left: 1.5em; padding-right: 1.5em;")

	Event	Fatalities	Injuries
1	HEAT	3169	9236
2	TORNADO	1593	22593
3	FLASH FLOOD	1039	1852
4	LIGHTNING	795	5116
5	RIP CURRENT	577	529
6	COLD/WIND CHILL	456	320
7	FLOOD	455	6779
8	THUNDERSTORM WIND	439	6017
9	HIGH WIND	256	1259
10	AVALANCHE	224	170
11	WINTER STORM	196	1313
12	HIGH SURF	160	245
13	HEAVY SNOW	138	1083
14	HURRICANE	135	1332
15	STRONG WIND	131	386
16	HEAVY RAIN	102	288
17	FROST/FREEZE	98	2074
18	WILDFIRE	87	1456
19	FOG	75	1063
20	BLIZZARD	71	389
21	WINTER WEATHER	70	663
22	TROPICAL STORM	66	383
23	DEBRIS FLOW	44	55
24	TSUNAMI	33	129
25	STORM SURGE/TIDE	25	54
26	DUST STORM	22	439
27	OTHER	22	30
28	MARINE THUNDERSTORM WIND	19	34
29	MARINE STRONG WIND	14	22
30	HAIL	10	953

As we can see, heat caused the most fatalities, while tornadoes are responsible for the most injuries in the given time period.

I will plot only 20 most severe events, and their impact across United States. I will use the coordinates of the geographical middle of the US (42.87 Lat, -97.38 Long) to split data into “quarters”.

health_sum_loc <- storm.data.tidy %>%
  filter(as.character(EVTYPE_CLEAN) %in% as.character(health_sum_type[1:20,]$Event)) %>%
  filter(!is.na(LATITUDE) & !is.na(LONGITUDE)) %>%
  mutate(Sh_Lat = ifelse(LATITUDE > 4287,"North","South"), 
         Sh_Long = ifelse(LONGITUDE > 9738,"West","East")) %>%
    mutate(Sh_Lat = factor(Sh_Lat, levels = c("North","South")), 
         Sh_Long = factor(Sh_Long,levels = c("West","East"))) %>%
  group_by(Event = EVTYPE_CLEAN, Sh_Lat, Sh_Long) %>%
  summarize(Fatalities = sum(FATALITIES), Injuries = sum(INJURIES))

ggplot(data = melt(health_sum_loc), aes(x = reorder(Event,value), y = value, fill = variable)) + geom_bar(position = position_dodge(), stat="identity") + facet_grid(Sh_Lat ~ Sh_Long) + coord_flip() + scale_y_continuous(name = "Number of accidents") + scale_x_discrete(name = "Names of events") + ggtitle(label = "Number of accicents affecting population health for the storm events with most impact, across United States")

## Using Event, Sh_Lat, Sh_Long as id variables

From the figure above we can see that the South East of the United States is the most affected area, although tornadoes are affecting also the South West and Nort East. The most harmful events are tornadoes, heat, thunderstorm winds and floods.

Events causing property damage

Let’s summarize the events by their impact on the economic damage. I’m taking a look at the first 30 entries sorted descending according to summarized property and crop damage. I scale the costs to millions of USD.

property_sum_type <- storm.data.tidy %>% 
  group_by(Event = EVTYPE_CLEAN) %>% 
  summarize(Property.dmg = sum(PROPDMG)/1000000, 
            Crop.dmg = sum(CROPDMG)/1000000) %>%
  arrange(desc(Property.dmg+Crop.dmg)) %>%
  mutate(Property.dmg = formatC(Property.dmg, format="f", digits=2, big.mark=","),
         Crop.dmg = formatC(Crop.dmg, format="f", digits=2, big.mark=","))
htmlTable(property_sum_type[1:30,],
          header = c("Event", "Property dmg, mln USD", "Crop dmg, mln USD"),
          align = c("r","c","c"), 
          css.cell = "padding-left: 5.5em; padding-right: 5.5em;")

	Event	Property dmg, mln USD	Crop dmg, mln USD
1	FLOOD	144,306.88	5,628.49
2	HURRICANE	85,300.91	5,515.62
3	STORM SURGE/TIDE	47,834.74	0.85
4	TORNADO	25,636.48	361.83
5	FLASH FLOOD	17,404.43	1,574.88
6	HAIL	15,580.34	3,003.63
7	DROUGHT	1,046.11	13,922.07
8	THUNDERSTORM WIND	10,746.66	1,263.35
9	FROST/FREEZE	3,869.21	6,913.17
10	TROPICAL STORM	7,713.89	694.39
11	WILDFIRE	7,877.56	403.28
12	HIGH WIND	5,416.09	668.30
13	HEAVY RAIN	3,224.99	943.51
14	WINTER STORM	1,620.10	16.94
15	COLD/WIND CHILL	129.73	1,426.77
16	HEAT	20.33	904.47
17	HEAVY SNOW	774.60	129.18
18	LIGHTNING	878.44	11.99
19	BLIZZARD	527.01	7.06
20	COASTAL FLOOD	438.05	0.06
21	DEBRIS FLOW	326.53	20.02
22	STRONG WIND	180.68	70.75
23	TSUNAMI	144.06	0.02
24	HIGH SURF	100.83	0.00
25	WINTER WEATHER	30.39	15.00
26	LAKE-EFFECT SNOW	40.12	0.00
27	FOG	22.22	0.00
28	ASTRONOMICAL HIGH TIDE	9.43	0.00
29	DUST STORM	5.54	3.10
30	WATERSPOUT	7.30	0.00

As we can see, floods and hurricanes are the most damaging events economically.

Again, I will plot only 20 most severe events, and their impact across United States. I will use the coordinates of the geographical middle of the US (42.87 Lat, -97.38 Long) to split data into “quarters”.

property_sum_loc <- storm.data.tidy %>%
  filter(as.character(EVTYPE_CLEAN) %in% as.character(property_sum_type[1:20,]$Event)) %>%
  filter(!is.na(LATITUDE) & !is.na(LONGITUDE)) %>%
  mutate(Sh_Lat = ifelse(LATITUDE > 4287,"North","South"), 
         Sh_Long = ifelse(LONGITUDE > 9738,"West","East")) %>%
    mutate(Sh_Lat = factor(Sh_Lat, levels = c("North","South")), 
         Sh_Long = factor(Sh_Long,levels = c("West","East"))) %>%
  group_by(Event = EVTYPE_CLEAN, Sh_Lat, Sh_Long) %>%
  summarize(Property.dmg = sum(PROPDMG)/1000000, Crop.dmg = sum(CROPDMG)/1000000)

ggplot(data = melt(property_sum_loc), aes(x = reorder(Event,value), y = value, fill = variable)) + geom_bar(position = position_dodge(), stat="identity") + facet_grid(Sh_Lat ~ Sh_Long) + coord_flip() + scale_y_continuous(name = "Costs of damage in millions $") + scale_x_discrete(name = "Names of events") + ggtitle(label = "Costs of economic damage for the storm events with most impact, across United States")

## Using Event, Sh_Lat, Sh_Long as id variables

From the figure above we can see that the South East and South West of the United States are the most affected area. Flood in the South West generated the highest property costs across the United States in the analyzed period. In the South East hurricanes and storm surge/tide events generated the biggest economic burden. Interestingly, these events do not correlate well with the events affecting population health (tornadoes, heat), altohugh the top ten lists overlap.