In this report we look at data from the National Oceanographic & Atmospheric Administration (NOAA) storm data this data analysis intends to answer these questions:
To answer these questions we standardized and summarized the data to compare the impact of the various storm event types. We found that excessive heat and tornadoes are most harmful to human life and that hurricanes and storm tides have the greatest economic impact.
Storm event data is provided by the NOAA at this location. The combined event data starting from 1950 is included in the zip file used in the code below.
48 event types have been used from 1996 to present are provided by the NOAA at this location. The event type values provided in a Word document have been stored in a .csv file for use here.Prior to 1996 the only three event types used were “Tornado”, “Thunderstorm Wind” and “Hail”. A larger set of possible values might have changed the distribution events by event type over this period.
Reading the compressed zip file - this includes data from 1950:
setwd("C:/Users/IBM_ADMIN/Documents/Self/Training/JohnsHopkinsDataScience/05 - Reproducable Research/Prog 2")
dt <- read.csv("repdata_data_StormData.csv.bz2", stringsAsFactors = FALSE, strip.white = TRUE)
dim(dt)
## [1] 902297 37
str(dt)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
Both questions focus on the event type determined for each storm record in the database. A quick check confirms there are no missing values for this variable but that the number of event types used in the data is much larger than the official 48 event types.
sum(is.na(dt$EVTYPE))
## [1] 0
length(table(dt$EVTYPE))
## [1] 985
Build a vector of the 48 official event types can be built from the source identified earlier:
evType <- read.csv("eventtypes.csv", header=FALSE, stringsAsFactors = FALSE, strip.white = TRUE)
names(evType) <- c("eventType", "zone")
#Take out leading blanks and upshift
trim.leading <- function (x) sub("^\\s+", "", x)
ev <- toupper(trim.leading(evType$eventType))
Both property and crop damage are represented by and amount value and a scalar code such as “K” meaning thousand or “8” meaning multiply by ten to the eighth power. We compute the amount for property and crop damage: $^8 $
#Extract the year
dt$dat <- as.Date(dt$BGN_DATE, '%m/%d/%Y')
dt$yr <- year(dt$dat)
dt$PROPDMGEXP <- toupper(dt$PROPDMGEXP)
dt$CROPDMGEXP <- toupper(dt$CROPDMGEXP)
dt$propExp <- 1
dt$propExp [dt$PROPDMGEXP == "1"] <- 10
dt$propExp [dt$PROPDMGEXP == "2"] <- 100
dt$propExp [dt$PROPDMGEXP == "3"] <- 1000
dt$propExp [dt$PROPDMGEXP == "4"] <- 10000
dt$propExp [dt$PROPDMGEXP == "5"] <- 100000
dt$propExp [dt$PROPDMGEXP == "6"] <- 1000000
dt$propExp [dt$PROPDMGEXP == "7"] <- 10000000
dt$propExp [dt$PROPDMGEXP == "8"] <- 100000000
dt$propExp [dt$PROPDMGEXP == "B"] <- 1000000000
dt$propExp [dt$PROPDMGEXP == "H"] <- 100
dt$propExp [dt$PROPDMGEXP == "K"] <- 1000
dt$propExp [dt$PROPDMGEXP == "M"] <- 1000000
dt$propDamage <- dt$PROPDMG * dt$propExp
dt$cropExp <- 1
dt$cropExp [dt$CROPDMGEXP == "2"] <- 100
dt$cropExp [dt$CROPDMGEXP == "K"] <- 1000
dt$cropExp [dt$CROPDMGEXP == "M"] <- 1000000
dt$cropExp [dt$CROPDMGEXP == "B"] <- 1000000000
dt$cropDamage <- dt$CROPDMG * dt$cropExp
dt$damage <- dt$propDamage + dt$cropDamage
Looking at the US Dollar impact of storm damage
summary(dt$propDamage)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000e+00 0.000e+00 0.000e+00 4.746e+05 5.000e+02 1.150e+11
summary(dt$cropDamage)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000e+00 0.000e+00 0.000e+00 5.442e+04 0.000e+00 5.000e+09
There are a few event types which describe themselves as “summary” - these won’t match any of the event types so we remove them. From 1950 to 1995 there were only three event types used (sometimes not even all three in a given year) so we won’t learn enough from this earlier period - these are also removed.
dt$event <- toupper(trim.leading(dt$EVTYPE))
dt$mevent <- ""
#remove"Summary" event types - they won't match any event type
dt <- dt [!grepl("^SUMMARY", dt$event), ]
#remove data prior to 1996 - we won't learn about different event types
dt <- dt [dt$yr >= 1996, ]
The remaining more than 900 event type values can be largely matched using a character matching routine in R named “amatch”.
#match the event type from the database to the official event type list (amatch)
dt$ematch <- amatch(dt$event, ev, nomatch=0, maxDist = 6)
dt$mevent [dt$ematch > 0] <- ev[dt$ematch]
length(dt$ematch [dt$ematch != 0]) / length(dt$ematch)
## [1] 0.9873411
length(dt$mevent [dt$mevent != ""]) / length(dt$mevent)
## [1] 0.9873411
How well did the automated matching work? After a few trials with the maxDist parameter, the matching rate seems acceptable. More than 98% of the event types are matched. This approach has several advantages over observing patterns and manually creating string manipulation commands. Event types from the database still unmatched are combined in a synthetic category for “all other”.
#calculate % of unmatched damage
sum(dt$damage [dt$mevent == ""]) / sum(dt$damage)
## [1] 0.01241253
#calculate % of unmatched fatality
sum(dt$FATALITIES [dt$mevent == ""]) / sum(dt$FATALITIES)
## [1] 0.02874485
#calculate % of unmatched injury
sum(dt$INJURIES [dt$mevent == ""]) / sum(dt$INJURIES)
## [1] 0.0189392
#how many records remain unmatched?
length(table(dt$event [dt$mevent == ""]))
## [1] 197
#Mark event types not matched as 'all other'
dt$mevent[dt$mevent == ""] <- "ALL-OTHER-TYPES"
The data can be then summarized by event type and the highest impact even types for fatality, injury, property damage and crop damage can be extracted (separate sets of event types for each of the four measures):
nt <- select(dt, mevent, damage, propDamage, cropDamage, FATALITIES, INJURIES)
gt <- group_by(nt, mevent)
dsumm <- summarise(group_by(gt, mevent),
sum(damage),
sum(propDamage),
sum(cropDamage),
sum(FATALITIES),
sum(INJURIES)
)
names(dsumm) <- c("evType", "damage", "propDamage", "cropDamage", "fatalities", "injuries")
sl <- 6
#Select the top property damage event type
dsumm <- arrange(dsumm, desc(propDamage))
sumProp <- data.frame(typ="Property", level=dsumm$propDamage[1:sl], evType=dsumm$evType[1:sl])
#and crop damage event type
dsumm <- arrange(dsumm, desc(cropDamage))
sumCrop <- data.frame(typ="Crops", level=dsumm$cropDamage[1:sl], evType=dsumm$evType[1:sl])
#and top event type for fatalities
dsumm<- arrange(dsumm, desc(fatalities))
sumFatl <- data.frame(typ="Fatalities",level=dsumm$fatalities[1:sl], evType=dsumm$evType[1:sl])
#and injuries
dsumm<- arrange(dsumm, desc(injuries))
sumInju <- data.frame(typ="Injuries",level=dsumm$injuries[1:sl], evType=dsumm$evType[1:sl])
d10 <- rbind(sumFatl, sumInju, sumProp, sumCrop)
In addition to looking at the summary over all years it might be instructive to summarize the damage data by year to view impact of event types varies over time. Here we select event types if they were in the top 10% of damage in any given year:
ny <- select(dt, mevent, yr, damage, propDamage, cropDamage, FATALITIES, INJURIES)
gy <- group_by(ny, yr, mevent)
dsumy <- summarise(group_by(gy, yr, mevent),
sum(damage),
sum(propDamage),
sum(cropDamage),
sum(FATALITIES),
sum(INJURIES)
)
names(dsumy) <- c("year", "evType", "damage", "propDamage", "cropDamage", "fatalities", "injuries")
#Select data for event types if they were in the top 10% in any year
sumyFtype <- dsumy$evType [dsumy$fatalities > quantile(dsumy$fatalities, .95)]
sumyFatalities <- dsumy [(dsumy$evType %in% sumyFtype), c("year", "evType", "fatalities")]
#Select data for event types if they were in the top 10% in any year
sumyDtype <- dsumy$evType [dsumy$damage > quantile(dsumy$damage, .95)]
sumyDamage <- dsumy [(dsumy$evType %in% sumyDtype), c("year", "evType", "damage")]
From the summarized data across all years we can see the highest impact event types across all years.
ggplot(d10, aes(x=evType, y=level)) +
geom_bar(stat="identity") +
facet_wrap(~typ, scales="free", ncol=1) +
ggtitle("Fatalities, Injuries, Property Damage and Crop Damage (1996-2011)") +
xlab("Highest impact event types - Property and Crop damage in US Dollars")
Fatalities from storm events can be seen over time:
ggplot(sumyFatalities, aes(x=year, y=fatalities, group=evType, colour=evType, fill=evType)) +
geom_bar(stat="identity", position="dodge") +
ggtitle("Fatalities from Storms by top event types (1996-2011)") +
ylab("Number of Fatalities")
The damage by event type can be seen over time:
ggplot(sumyDamage, aes(x=year, y=damage, group=evType, colour=evType, fill=evType)) +
geom_bar(stat="identity", position="dodge") +
ggtitle("Property + Crop damage produced by top event types (1996-2011)") +
ylab("Damage in US Dollars")
Thanks to Robert Carman for the suggestion in the discussion forum to use the amatch routine for event type matching. After trying a few different parameters this approach seemed to be effective and reproduceable.
My session info:
sessionInfo()
## R version 3.2.1 (2015-06-18)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggplot2_2.0.0 dplyr_0.4.3 lubridate_1.5.0 stringdist_0.9.4
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.2 digest_0.6.8 assertthat_0.1 plyr_1.8.3
## [5] grid_3.2.1 R6_2.1.1 gtable_0.1.2 DBI_0.3.1
## [9] formatR_1.2.1 magrittr_1.5 scales_0.3.0 evaluate_0.8
## [13] stringi_1.0-1 lazyeval_0.1.10 rmarkdown_0.8.1 labeling_0.3
## [17] tools_3.2.1 stringr_1.0.0 munsell_0.4.2 yaml_2.1.13
## [21] parallel_3.2.1 colorspace_1.2-6 htmltools_0.2.6 knitr_1.11
```