The exploratory analysis below is performed on the dataset of NOAA Storm Database. The dataset, after an extensive cleaning, is used for three analyses. First, I take an overview of total impact on population health and economic damage by all storm-related events over time. Second, I identify the events most harmful for population health in United States. Finally, I identify the events causing the biggest economic damage across United States. I briefly discuss each figure.
Here is the setup for data operations and results, including the libraries:
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
library(reshape2)
library(htmlTable)
sessionInfo()
## R version 3.3.2 (2016-10-31)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: macOS Sierra 10.12.6
##
## locale:
## [1] C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] htmlTable_1.9 reshape2_1.4.2 lubridate_1.6.0 ggplot2_2.2.1
## [5] dplyr_0.7.1
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.11 knitr_1.16 bindr_0.1 magrittr_1.5
## [5] munsell_0.4.3 colorspace_1.3-2 R6_2.2.2 rlang_0.1.1
## [9] plyr_1.8.4 stringr_1.2.0 tools_3.3.2 grid_3.3.2
## [13] checkmate_1.8.3 gtable_0.2.0 htmltools_0.3.6 lazyeval_0.2.0
## [17] yaml_2.1.14 assertthat_0.2.0 rprojroot_1.2 digest_0.6.12
## [21] tibble_1.3.3 bindrcpp_0.2 htmlwidgets_0.8 glue_1.1.1
## [25] evaluate_0.10 rmarkdown_1.6 stringi_1.1.5 scales_0.4.1
## [29] backports_1.1.0 pkgconfig_2.0.1
Here I’m downloading the storm data file, if it doesn’t exist.
if(!file.exists("Storm Data.csv.bz2")) {
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
destfile = "Storm Data.csv.bz2")
}
Here I’m creating the table from the bz2 storm data file and store it in the cache file, if it doesn’t exist. Otherwise I load it from the cache. Because of the read.csv() capabilities, I can read the data file directly from the .bz2 format.
if(!file.exists("Storm Data.Rd")) {
storm.data <- tbl_df(read.csv("Storm Data.csv.bz2"))
save(storm.data, file = "Storm Data.Rd")
} else {
load("Storm Data.Rd")
}
Here I perform a small date and time cleaning for BGN/END DATE by changing them to a Date tyoe with the help of lubridate. A small helper function makes it easier to define a clean pipeline. I remove time and time zone information, it is redundant for this analysis.
format_NOAA_date <- function(dates) {
new.dates <- sapply(as.character(dates),
function(x) strsplit(x, split = " ")[[1]][1])
return(mdy(new.dates))
}
storm.data <- mutate(storm.data,
BGN_DATE = format_NOAA_date(BGN_DATE),
END_DATE = format_NOAA_date(END_DATE)) %>%
select(-BGN_TIME,-END_TIME,-TIME_ZONE)
Here’s the cleaning of PROPDMG and CROPDMG variables, as the number is actually split into two columns, with PROP/CROP_EXP columns containing the notation for hundreds (H), thousands (K), millions (M), billions (B), and explicing numeric encodings of the exponents. I remove the CROP/PROP_EXP columns as redundant afterwards.
format_NOAA_dmg <- function(dmgval,dmgexp) {
new.dmgval <- dmgval
this.dmgexp <- toupper(as.character(dmgexp))
this.dmgexp[this.dmgexp == "H"] <- "2"
this.dmgexp[this.dmgexp == "K"] <- "3"
this.dmgexp[this.dmgexp == "M"] <- "6"
this.dmgexp[this.dmgexp == "B"] <- "9"
this.dmgexp[this.dmgexp %in% c("+","-","","?")] <- "0"
new.dmgval <- dmgval*10^as.numeric(this.dmgexp)
return(new.dmgval)
}
storm.data <- mutate(storm.data,
PROPDMG = format_NOAA_dmg(PROPDMG,PROPDMGEXP),
CROPDMG = format_NOAA_dmg(CROPDMG,CROPDMGEXP)) %>%
select(-PROPDMGEXP,-CROPDMGEXP)
I need to clean EVTYPE codes, as they are quite a mess. This involves: 1) correction/removal of special characters (e.g. starting or trailing spaces) and converting all terms to uppercase 2) unification of terms (e.g. TSTM equals THUNDERSTORM); I use a defined term_mapping table 3) cleaning related types by grouping, as there are many versions of the same type (e.g. THUNDERSTORM, THUNDERSTORM WINDS, THUNDERSTORMS); the grouing is informed by the documentation for the dataset. I use a defined term_grouping table.
term_mapping <- rbind(
c("TSTM","THUNDERSTORM"),
c("THUNDESTORM|THUNERSTORM|THUNDERTORM|THUNDERTSORM|THUNDERESTORM|THUNDERSTROM","THUNDERSTORM"),
c("WND","WIND"),
c("WINDSS","WIND"),
c("WINDS","WIND"),
c("WINS","WIND"),
c("CSTL", "COASTAL"),
c("AVALANCE", "AVALANCHE"),
c("HVY", "HEAVY"),
c("EROSIN", "EROSION"),
c("FLD|FLOODG|FLOODING", "FLOOD"),
c("PRECIPITATION|PRECIPATATION|SHOWERS|SHOWER", "RAIN")
)
term_grouping <- rbind(
c("^THUNDERSTORM|THUNDEERSTORM|THUNDERSNOW|MICROBURST|SEVERE THUNDERSTORM","THUNDERSTORM WIND"),
c("^HURRICANE|TYPHOON","HURRICANE"),
c("TORNADO","TORNADO"),
c("^RIP CURRENT","RIP CURRENT"),
c("^SNOW","HEAVY SNOW"),
c("^WIND$","STRONG WIND"),
c("^ICE|^FREEZING RAIN|BLACK ICE|FROST|FREEZE|DAMAGING FREEZE|COOL AND WET","FROST/FREEZE"),
c("^GLAZE","FREEZING FOG"),
c("FOREST FIRE|WILDFIRES|WILD FIRES","WILDFIRE"),
c("COLD|EXTREME WINDCHILL|LOW TEMPERATURE","COLD/WIND CHILL"),
c("WINTRY MIX|WINTER WEATHER MIX|WINTER WEATHER/MIX|ICY ROADS|FREEZING DRIZZLE|LIGHT SNOW","WINTER WEATHER"),
c("^WINTER STORM","WINTER STORM"),
c("^HEAVY SURF|ROUGH SURF","HIGH SURF"),
c("RECORD HEAT|HEAT WAVE|EXTREME HEAT|RECORD/EXCESSIVE HEAT|^EXCESSIVE HEAT|^UNSEASONABLY WARM","HEAT"),
c("LANDSLIDE|MUDSLIDE|MUD SLIDE","DEBRIS FLOW"),
c("^HIGH WIND","HIGH WIND"),
c("STREAM FLOOD|FLASH FLOO|^URBAN|FLOOD?FLASH|RIVER FLOOD","FLASH FLOOD"),
c("^FLOOD|LAKESHORE FLOOD|MAJOR FLOOD","FLOOD"),
c("^BLIZZARD","BLIZZARD"),
c("HAIL","HAIL"),
c("DENSE FOG|VOG","FOG"),
c("DRYNESS|DRY |DROUGHT|EXCESSIVELY DRY","DROUGHT"),
c("^HEAVY RAIN|EXCESSIVE RAINFALL|MIXED PRECIP|EXCESSIVE WETNESS|UNSEASONAL RAIN|^RAIN$","HEAVY RAIN"),
c("^HEAVY SNOW|HEAVY WET SNOW|EXCESSIVE SNOW|RECORD SNOW","HEAVY SNOW"),
c("^TROPICAL STORM","TROPICAL STORM"),
c("^STORM SURGE$|HIGH SEAS|ROUGH SEAS","STORM SURGE/TIDE"),
c("COASTAL FLOOD/EROSION|COASTAL STORM|EROSION/COASTAL FLOOD","COASTAL FLOOD"),
c("LIGHTNING FIRE","LIGHTNING"),
c("^HYPOTHERMIA/EXPOSURE$|NON-SEVERE WIND DAMAGE|MARINE MISHAP|GUSTY WIND|RAIN/SNOW|HEAVY MIX","OTHER")
)
to_clean <- toupper(as.character(storm.data$EVTYPE))
to_clean <- gsub("\\.$|^ {1,3}|[^A-Z0-9]{1,3}$","",to_clean)
to_clean <- gsub(" {2,20}"," ",to_clean)
for(i in 1:nrow(term_mapping)) { to_clean <- gsub(term_mapping[i,1],term_mapping[i,2],to_clean) }
for(i in 1:nrow(term_grouping)) { to_clean[grep(term_grouping[i,1], to_clean)] <- term_grouping[i,2] }
storm.data <- mutate(storm.data, EVTYPE_CLEAN = as.factor(to_clean))
After loading and cleaning, I trim the dataset by selected columns and I check the distribution of BGN_DATE. It looks like only a portion of the dataset goes back to 1950. I would like to operate on a time span that has data points for more event types. Based on the 1st quartile of the dataset I trim the dates to 1994.
storm.data.tidy <- select(storm.data, STATE, BGN_DATE,
EVTYPE_CLEAN, FATALITIES, INJURIES,
PROPDMG, CROPDMG,
LATITUDE,LONGITUDE)
storm.data.tidy %>%
group_by(EVTYPE_CLEAN) %>%
summarize(minyear = min(year(BGN_DATE))) %>%
select(minyear) %>%
summary()
## minyear
## Min. :1950
## 1st Qu.:1994
## Median :1996
## Mean :1996
## 3rd Qu.:1997
## Max. :2007
storm.data.tidy <- filter(storm.data.tidy, year(BGN_DATE) > 1993)
After trimming, the tidy dataset looks as follows:
str(storm.data.tidy)
## Classes 'tbl_df', 'tbl' and 'data.frame': 702131 obs. of 9 variables:
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ BGN_DATE : Date, format: "1995-01-06" "1995-01-22" ...
## $ EVTYPE_CLEAN: Factor w/ 330 levels "","ABNORMAL WARMTH",..: 67 86 67 86 86 103 278 278 280 79 ...
## $ FATALITIES : num 0 0 0 0 0 2 0 0 0 0 ...
## $ INJURIES : num 0 0 2 0 0 0 0 0 0 0 ...
## $ PROPDMG : num 0e+00 0e+00 0e+00 0e+00 0e+00 1e+08 5e+04 5e+06 5e+05 0e+00 ...
## $ CROPDMG : num 0e+00 0e+00 0e+00 0e+00 0e+00 1e+07 0e+00 5e+05 0e+00 0e+00 ...
## $ LATITUDE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LONGITUDE : num 0 0 0 0 0 0 0 0 0 0 ...
First, I’m taking a look at the total population health and economic damage from storms in United States over considered time span (1994-2011).
health_sum <- storm.data.tidy %>%
group_by(year = year(BGN_DATE)) %>%
summarize(FATALITIES = sum(FATALITIES), INJURIES = sum(INJURIES))
dmg_sum <- storm.data.tidy %>%
group_by(year = year(BGN_DATE)) %>%
summarize(PROPDMG = sum(PROPDMG), CROPDMG = sum(CROPDMG))
flat_table <- rbind(
data.frame(year = health_sum$year, numbers = health_sum$FATALITIES,
type = "Fatalities", class = "Population health"),
data.frame(year = health_sum$year, numbers = health_sum$INJURIES,
type = "Injuries", class = "Population health"),
data.frame(year = dmg_sum$year, numbers = dmg_sum$PROPDMG/1000000,
type = "Property dmg", class = "Economic dmg (Mln $)"),
data.frame(year = dmg_sum$year, numbers = dmg_sum$CROPDMG/1000000,
type = "Crop dmg", class = "Economic dmg (Mln $)")
)
ggplot(data = flat_table, aes(x = year, y = numbers, color = type)) + geom_point() + geom_line() + facet_grid(class ~ .,scales = "free") + ggtitle("Total health and economic damage in United States from storms")
From the figure above we can see that there is little correlation between economic and health damage. An Outstanding population health damage took place in years 1995, 1998, 2006 and 2011. An outstanding economic damage took place in years 2005 and 2006.
Let’s summarize the events by their impact on the population health. I’m taking a look at the first 30 entries sorted descending according to fatalities and injuries.
health_sum_type <- storm.data.tidy %>%
group_by(Event = EVTYPE_CLEAN) %>%
summarize(Fatalities = sum(FATALITIES), Injuries = sum(INJURIES)) %>%
arrange(desc(Fatalities,Injuries))
htmlTable(health_sum_type[1:30,],
align = c("r","c","c"),
css.cell = "padding-left: 1.5em; padding-right: 1.5em;")
| Event | Fatalities | Injuries | |
|---|---|---|---|
| 1 | HEAT | 3169 | 9236 |
| 2 | TORNADO | 1593 | 22593 |
| 3 | FLASH FLOOD | 1039 | 1852 |
| 4 | LIGHTNING | 795 | 5116 |
| 5 | RIP CURRENT | 577 | 529 |
| 6 | COLD/WIND CHILL | 456 | 320 |
| 7 | FLOOD | 455 | 6779 |
| 8 | THUNDERSTORM WIND | 439 | 6017 |
| 9 | HIGH WIND | 256 | 1259 |
| 10 | AVALANCHE | 224 | 170 |
| 11 | WINTER STORM | 196 | 1313 |
| 12 | HIGH SURF | 160 | 245 |
| 13 | HEAVY SNOW | 138 | 1083 |
| 14 | HURRICANE | 135 | 1332 |
| 15 | STRONG WIND | 131 | 386 |
| 16 | HEAVY RAIN | 102 | 288 |
| 17 | FROST/FREEZE | 98 | 2074 |
| 18 | WILDFIRE | 87 | 1456 |
| 19 | FOG | 75 | 1063 |
| 20 | BLIZZARD | 71 | 389 |
| 21 | WINTER WEATHER | 70 | 663 |
| 22 | TROPICAL STORM | 66 | 383 |
| 23 | DEBRIS FLOW | 44 | 55 |
| 24 | TSUNAMI | 33 | 129 |
| 25 | STORM SURGE/TIDE | 25 | 54 |
| 26 | DUST STORM | 22 | 439 |
| 27 | OTHER | 22 | 30 |
| 28 | MARINE THUNDERSTORM WIND | 19 | 34 |
| 29 | MARINE STRONG WIND | 14 | 22 |
| 30 | HAIL | 10 | 953 |
As we can see, heat caused the most fatalities, while tornadoes are responsible for the most injuries in the given time period.
I will plot only 20 most severe events, and their impact across United States. I will use the coordinates of the geographical middle of the US (42.87 Lat, -97.38 Long) to split data into “quarters”.
health_sum_loc <- storm.data.tidy %>%
filter(as.character(EVTYPE_CLEAN) %in% as.character(health_sum_type[1:20,]$Event)) %>%
filter(!is.na(LATITUDE) & !is.na(LONGITUDE)) %>%
mutate(Sh_Lat = ifelse(LATITUDE > 4287,"North","South"),
Sh_Long = ifelse(LONGITUDE > 9738,"West","East")) %>%
mutate(Sh_Lat = factor(Sh_Lat, levels = c("North","South")),
Sh_Long = factor(Sh_Long,levels = c("West","East"))) %>%
group_by(Event = EVTYPE_CLEAN, Sh_Lat, Sh_Long) %>%
summarize(Fatalities = sum(FATALITIES), Injuries = sum(INJURIES))
ggplot(data = melt(health_sum_loc), aes(x = reorder(Event,value), y = value, fill = variable)) + geom_bar(position = position_dodge(), stat="identity") + facet_grid(Sh_Lat ~ Sh_Long) + coord_flip() + scale_y_continuous(name = "Number of accidents") + scale_x_discrete(name = "Names of events") + ggtitle(label = "Number of accicents affecting population health for the storm events with most impact, across United States")
## Using Event, Sh_Lat, Sh_Long as id variables
From the figure above we can see that the South East of the United States is the most affected area, although tornadoes are affecting also the South West and Nort East. The most harmful events are tornadoes, heat, thunderstorm winds and floods.
Let’s summarize the events by their impact on the economic damage. I’m taking a look at the first 30 entries sorted descending according to summarized property and crop damage. I scale the costs to millions of USD.
property_sum_type <- storm.data.tidy %>%
group_by(Event = EVTYPE_CLEAN) %>%
summarize(Property.dmg = sum(PROPDMG)/1000000,
Crop.dmg = sum(CROPDMG)/1000000) %>%
arrange(desc(Property.dmg+Crop.dmg)) %>%
mutate(Property.dmg = formatC(Property.dmg, format="f", digits=2, big.mark=","),
Crop.dmg = formatC(Crop.dmg, format="f", digits=2, big.mark=","))
htmlTable(property_sum_type[1:30,],
header = c("Event", "Property dmg, mln USD", "Crop dmg, mln USD"),
align = c("r","c","c"),
css.cell = "padding-left: 5.5em; padding-right: 5.5em;")
| Event | Property dmg, mln USD | Crop dmg, mln USD | |
|---|---|---|---|
| 1 | FLOOD | 144,306.88 | 5,628.49 |
| 2 | HURRICANE | 85,300.91 | 5,515.62 |
| 3 | STORM SURGE/TIDE | 47,834.74 | 0.85 |
| 4 | TORNADO | 25,636.48 | 361.83 |
| 5 | FLASH FLOOD | 17,404.43 | 1,574.88 |
| 6 | HAIL | 15,580.34 | 3,003.63 |
| 7 | DROUGHT | 1,046.11 | 13,922.07 |
| 8 | THUNDERSTORM WIND | 10,746.66 | 1,263.35 |
| 9 | FROST/FREEZE | 3,869.21 | 6,913.17 |
| 10 | TROPICAL STORM | 7,713.89 | 694.39 |
| 11 | WILDFIRE | 7,877.56 | 403.28 |
| 12 | HIGH WIND | 5,416.09 | 668.30 |
| 13 | HEAVY RAIN | 3,224.99 | 943.51 |
| 14 | WINTER STORM | 1,620.10 | 16.94 |
| 15 | COLD/WIND CHILL | 129.73 | 1,426.77 |
| 16 | HEAT | 20.33 | 904.47 |
| 17 | HEAVY SNOW | 774.60 | 129.18 |
| 18 | LIGHTNING | 878.44 | 11.99 |
| 19 | BLIZZARD | 527.01 | 7.06 |
| 20 | COASTAL FLOOD | 438.05 | 0.06 |
| 21 | DEBRIS FLOW | 326.53 | 20.02 |
| 22 | STRONG WIND | 180.68 | 70.75 |
| 23 | TSUNAMI | 144.06 | 0.02 |
| 24 | HIGH SURF | 100.83 | 0.00 |
| 25 | WINTER WEATHER | 30.39 | 15.00 |
| 26 | LAKE-EFFECT SNOW | 40.12 | 0.00 |
| 27 | FOG | 22.22 | 0.00 |
| 28 | ASTRONOMICAL HIGH TIDE | 9.43 | 0.00 |
| 29 | DUST STORM | 5.54 | 3.10 |
| 30 | WATERSPOUT | 7.30 | 0.00 |
As we can see, floods and hurricanes are the most damaging events economically.
Again, I will plot only 20 most severe events, and their impact across United States. I will use the coordinates of the geographical middle of the US (42.87 Lat, -97.38 Long) to split data into “quarters”.
property_sum_loc <- storm.data.tidy %>%
filter(as.character(EVTYPE_CLEAN) %in% as.character(property_sum_type[1:20,]$Event)) %>%
filter(!is.na(LATITUDE) & !is.na(LONGITUDE)) %>%
mutate(Sh_Lat = ifelse(LATITUDE > 4287,"North","South"),
Sh_Long = ifelse(LONGITUDE > 9738,"West","East")) %>%
mutate(Sh_Lat = factor(Sh_Lat, levels = c("North","South")),
Sh_Long = factor(Sh_Long,levels = c("West","East"))) %>%
group_by(Event = EVTYPE_CLEAN, Sh_Lat, Sh_Long) %>%
summarize(Property.dmg = sum(PROPDMG)/1000000, Crop.dmg = sum(CROPDMG)/1000000)
ggplot(data = melt(property_sum_loc), aes(x = reorder(Event,value), y = value, fill = variable)) + geom_bar(position = position_dodge(), stat="identity") + facet_grid(Sh_Lat ~ Sh_Long) + coord_flip() + scale_y_continuous(name = "Costs of damage in millions $") + scale_x_discrete(name = "Names of events") + ggtitle(label = "Costs of economic damage for the storm events with most impact, across United States")
## Using Event, Sh_Lat, Sh_Long as id variables
From the figure above we can see that the South East and South West of the United States are the most affected area. Flood in the South West generated the highest property costs across the United States in the analyzed period. In the South East hurricanes and storm surge/tide events generated the biggest economic burden. Interestingly, these events do not correlate well with the events affecting population health (tornadoes, heat), altohugh the top ten lists overlap.