The purpose of this analysis is to look at data from the NOAA Storm Database from years 1950-2011 and analyze two questions of interest:
To answer these questions, I will organize the data from the database. To find which events are most harmful with respect to health, I will look at which types of events have historically caused the most fatalities. To find which events cause the greatest economic consequences, I will look at which events have historically caused the greatest monetary loss to crop damage and property damage.
I will be using the reshape2, ggplot2, dplyr, and knitr libraries throughout this analysis.
#Load necessary libraries to be used
library(knitr)
library(reshape2)
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
First, I read in the data and take a look at it.
df <- read.csv("repdata_data_StormData.csv")
str(df)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels ""," Christiansburg",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels ""," CANTON"," TULIA",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","%SD",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436781 levels "","\t","\t\t",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
It turns out the event type classifications are messy and inconsistent as shown below:
unique(df$EVTYPE)[1:20]
## [1] TORNADO TSTM WIND
## [3] HAIL FREEZING RAIN
## [5] SNOW ICE STORM/FLASH FLOOD
## [7] SNOW/ICE WINTER STORM
## [9] HURRICANE OPAL/HIGH WINDS THUNDERSTORM WINDS
## [11] RECORD COLD HURRICANE ERIN
## [13] HURRICANE OPAL HEAVY RAIN
## [15] LIGHTNING THUNDERSTORM WIND
## [17] DENSE FOG RIP CURRENT
## [19] THUNDERSTORM WINS FLASH FLOOD
## 985 Levels: HIGH SURF ADVISORY COASTAL FLOOD ... WND
We can see that a lot of events are categorized with different and sometimes incorrect spellings or capitalization, so need to be cleaned up. I have done some cleaning up shown by the code below:
#Cleaning up of EVTYPE
winterstorm <- c("icy", "ice", "snow", "wint", "hail", "freez", "frost",
"blizzard", "glaze", "sleet")
coastalstorm <- c("coastal", "surf", "waves", "current", "tsunami", "waterspout",
"high", "seas", "tide", "surge", "water", "wave", "typhoon")
marine <- c("marine", "drowning")
rainstorm <- c("rain", "precip")
cold <- c("cold", "hypothermia", "hyperthermia", "low temperature")
flooding <- c("floo", "fld")
df$EVTYPE <- as.character(df$EVTYPE)
df$EVTYPE[grepl("aval", df$EVTYPE, ignore.case=TRUE)] <- "Avalanche"
df$EVTYPE[grepl(paste(cold, collapse = "|"),
df$EVTYPE, ignore.case=TRUE)] <- "Cold"
df$EVTYPE[grepl("drough", df$EVTYPE, ignore.case=TRUE)] <- "Drought"
df$EVTYPE[grepl("dust", df$EVTYPE, ignore.case=TRUE)] <- "Dust Related"
df$EVTYPE[grepl(paste(flooding, collapse = "|"),
df$EVTYPE, ignore.case=TRUE)] <- "Flooding"
df$EVTYPE[grepl("fire", df$EVTYPE, ignore.case=TRUE)] <- "Fire/Wildfire"
df$EVTYPE[grepl("fog", df$EVTYPE, ignore.case=TRUE)] <- "Fog"
df$EVTYPE[grepl("heat|warm", df$EVTYPE, ignore.case=TRUE)] <- "Heat"
df$EVTYPE[grepl("mudslide", df$EVTYPE, ignore.case=TRUE)] <- "Mudslide"
df$EVTYPE[grepl("landslide", df$EVTYPE, ignore.case=TRUE)] <- "Landslide"
df$EVTYPE[grepl("lightning|thunder", df$EVTYPE, ignore.case=TRUE)] <- "Lightning"
df$EVTYPE[grepl("torn", df$EVTYPE, ignore.case=TRUE)] <- "Tornado"
df$EVTYPE[grepl(paste(winterstorm, collapse = "|"),
df$EVTYPE, ignore.case=TRUE)] <- "Winter Storm"
df$EVTYPE[grepl(paste(rainstorm, collapse = "|"),
df$EVTYPE, ignore.case=TRUE)] <- "Rain storm"
df$EVTYPE[grepl("wind|microburst|funnel", df$EVTYPE, ignore.case=TRUE)] <- "Winds"
df$EVTYPE[grepl("hurric", df$EVTYPE, ignore.case=TRUE)] <- "Hurricane"
df$EVTYPE[grepl("tropical", df$EVTYPE, ignore.case=TRUE)] <- "Tropical Storm"
df$EVTYPE[grepl(paste(coastalstorm, collapse = "|"),
df$EVTYPE, ignore.case=TRUE)] <- "Coastal Storm"
df$EVTYPE[grepl(paste(marine, collapse = "|"),
df$EVTYPE, ignore.case=TRUE)] <- "Marine Related"
The second messy part of this data set is related to the economic figures. The table contains two “EXP” columns which indicates a power of 10 multiplier for each of DMG values. There are some odd characters here, which need to be cleaned up
unique(df$PROPDMGEXP)
## [1] K M B m + 0 5 6 ? 4 2 3 h 7 H - 1 8
## Levels: - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
unique(df$CROPDMGEXP)
## [1] M K m B ? 0 k 2
## Levels: ? 0 2 B k K m M
The numeric values represent powers of 10. It should follow that the values h = hundreds, k = thousands, m = millions, b = billions. There is no way to know the values of -, ?, + so I will assign them values of 0.
#Clean PROPDMGEXP
df$PROPDMGEXP <- as.character(df$PROPDMGEXP)
df$PROPDMGEXP[df$PROPDMGEXP == "" | df$PROPDMGEXP == "+" | df$PROPDMGEXP == "-" | df$PROPDMGEXP == "?"] <- 0
df$PROPDMGEXP[df$PROPDMGEXP == "H" | df$PROPDMGEXP == "h"] <- 2
df$PROPDMGEXP[df$PROPDMGEXP == "K"] <- 3
df$PROPDMGEXP[df$PROPDMGEXP == "M" | df$PROPDMGEXP == "m"] <- 6
df$PROPDMGEXP[df$PROPDMGEXP == "B"] <- 9
df$PROPDMGEXP <- as.numeric(df$PROPDMGEXP)
unique(df$PROPDMGEXP)
## [1] 3 6 0 9 5 4 2 7 1 8
#Clean CROPDMGEXP
df$CROPDMGEXP <- as.character(df$CROPDMGEXP)
df$CROPDMGEXP[df$CROPDMGEXP == "?" | df$CROPDMGEXP == ""] <- 0
df$CROPDMGEXP[df$CROPDMGEXP == "K" | df$CROPDMGEXP == "k"] <- 3
df$CROPDMGEXP[df$CROPDMGEXP == "M" | df$CROPDMGEXP == "m"] <- 6
df$CROPDMGEXP[df$CROPDMGEXP == "B"] <- 9
df$CROPDMGEXP <- as.numeric(df$CROPDMGEXP)
unique(df$CROPDMGEXP)
## [1] 0 6 3 9 2
For our first question of interest, we are only concerned with the data with health consequences. So I have subsetted the data by those storms that caused any fatalities or injuries and aggregated the sums for each event type:
health <- subset(df, FATALITIES > 0 | INJURIES > 0)
X <- aggregate(cbind(FATALITIES, INJURIES) ~ EVTYPE, data = health, FUN = sum)
names(X) <- c("EventType", "Fatalities", "Injuries")
X <- arrange(X, desc(Fatalities + Injuries))[1:10,]
Here is a table of the top 10 events with the highest counts related to health consequences:
print(X)
## EventType Fatalities Injuries
## 1 Tornado 5661 91407
## 2 Heat 3172 9228
## 3 Flooding 1553 8683
## 4 Winds 986 8897
## 5 Lightning 1029 7711
## 6 Winter Storm 676 7843
## 7 Coastal Storm 821 999
## 8 Fire/Wildfire 90 1608
## 9 Hurricane 133 1328
## 10 Fog 80 1076
It may be useful to visualize the data as well:
healthmelted <- melt(X, id = "EventType")
ggplot(data = healthmelted, aes(x = EventType, y = value, fill = variable)) +
geom_bar(stat = "identity") +
scale_fill_brewer(palette = "Set1") +
ggtitle("Harmful US Storms 1950-2011") +
coord_flip()
For our second question of interest, we are only concerned with the data causing economic damage. I subsetted the data by those storms that caused any damage.
econ <- select(df, c(EVTYPE, contains("DMG")))
econ <- subset(econ, PROPDMG > 0 | CROPDMG > 0)
Now I aggregate the sums for each event type (in billions) and we can see this data in a table:
Y <- aggregate(cbind(CROPDMG*10^CROPDMGEXP, PROPDMG*10^PROPDMGEXP) ~ EVTYPE, data = econ, FUN = sum)
Y[,2:3] <- Y[,2:3]/1000000000
names(Y) <- c("EventType", "CropDamage", "PropertyDamage")
Y <- arrange(Y, desc(CropDamage + PropertyDamage))[1:10,]
print(Y)
## EventType CropDamage PropertyDamage
## 1 Flooding 12.3885972 168.270525
## 2 Hurricane 5.5052928 84.656180
## 3 Tornado 0.4174615 58.603318
## 4 Coastal Storm 0.0017000 48.831805
## 5 Winter Storm 10.4249263 28.469780
## 6 Drought 13.9726218 1.046306
## 7 Winds 1.3313427 10.679604
## 8 Fire/Wildfire 0.4032816 8.501629
## 9 Tropical Storm 0.6948960 7.716128
## 10 Lightning 0.6650675 7.576885
And to visualize the data:
econmelted <- melt(Y, id = "EventType")
ggplot(data = econmelted, aes(x = EventType, y = value, fill = variable)) +
geom_bar(stat = "identity") +
scale_fill_brewer(palette = "Set1") +
ggtitle("Economic Damage (Billions) from US Storms 1950-2011") +
coord_flip()
I think the data organized here paints some clear pictures that will be useful for government officials to use as they prioritize resources for different types of future events.