Storm related event destroy life and damage properties. It is very important to analize them and understand them in order to take preventive measure and allocate scarce funding. Using the National Climatic Data Center Storm Events database it is possible to see the impact of storm events and how this impact is changing with time.
Anayisis of storm Event database require a bit of preparation, which we describe here, in order to have a reproducible analysis and make it possible to easily reload the data for make additional analysis.
The database is downloaded from the Reproducible Analysis Coursera course website. More details are available:
We load the necessary libraries:
library(plyr)
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.0.3
We download locally the data and we load it in a data frame, then we decode the start date of the event.
# Get the data
setwd("~\\..\\Google Drive\\Data Science\\05 - Reproducible Research\\02 Peer")
if (!file.exists("./data")) {
dir.create("./data")
}
if (!file.exists("./data/storm.bz2")) {
Url = "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(Url, destfile = "./data/storm.bz2", mode = "wb", method = "auto")
}
# read the data on the storm variable
bz <- bzfile("./data/storm.bz2", "r") # bzip2-ed file
storm <- read.csv(bz)
close(bz)
The data frame is quite big: (397.7571 MB) so in case this analysis would be reproduced, we advise to process this data with a suitable computer (with at least 4GB RAM).
According to the database documentation (see here all kind of event type are recorded only starting in 1996, so it is sensible to restrict our analysis to the period where we could compare events. Moreover we restrict our working data frame only to the essential column, for speed and convenience. We also decode data and year and add these additional fields.
felder <- c("EVTYPE", "BGN_DATE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP",
"CROPDMG", "CROPDMGEXP")
storm.rec <- storm[as.Date(storm$BGN_DATE, "%m/%d/%Y") >= as.Date("19960101",
"%Y%m%d"), felder]
storm.rec$date <- as.Date(storm.rec$BGN_DATE, "%m/%d/%Y")
storm.rec$Year <- format(storm.rec$date, "%Y")
We obtain a much manageable object (40.8746 MB)
For the sake of this analysis we have to decode the damages, which are presented with a multiplier (K= thousand, M= millions, etc…). We also sum Property and Crop damage to obtain a Total Damage Value.
storm.rec$Pdmx <- 1
storm.rec[storm.rec$PROPDMGEXP %in% c("k", "K"), "Pdmx"] <- 1000
storm.rec[storm.rec$PROPDMGEXP %in% c("m", "M"), "Pdmx"] <- 1e+06
storm.rec[storm.rec$PROPDMGEXP %in% c("g", "G"), "Pdmx"] <- 1e+09
storm.rec$Cdmx <- 1
storm.rec[storm.rec$CROPDMGEXP %in% c("k", "K"), "Cdmx"] <- 1000
storm.rec[storm.rec$CROPDMGEXP %in% c("m", "M"), "Cdmx"] <- 1e+06
storm.rec[storm.rec$CROPDMGEXP %in% c("g", "G"), "Cdmx"] <- 1e+09
storm.rec$totalDamage <- storm.rec$PROPDMG * storm.rec$Pdmx + storm.rec$CROPDMG *
storm.rec$Cdmx
Now it is important to notice that there are 985 type of events (field EVCLASS), we need a less detailed classification. We just created a new field and classfy the events accordingly to their description. The classification could not be and need not to be perfect, as it is developed only for exposition purposes.
storm.rec$Class <- "other"
storm.rec[grep("tornado", ignore.case = T, storm.rec$EVTYPE), "Class"] <- "tornado-related"
storm.rec[grep("wind|wnd", ignore.case = T, storm.rec$EVTYPE), "Class"] <- "wind-related"
storm.rec[grep("heat|warm|drought", ignore.case = T, storm.rec$EVTYPE), "Class"] <- "heat-related"
storm.rec[grep("fire", ignore.case = T, storm.rec$EVTYPE), "Class"] <- "fire-related"
storm.rec[grep("hurricane|typhoon|tropical", ignore.case = T, storm.rec$EVTYPE),
"Class"] <- "hurricane-related"
storm.rec[grep("lightning", ignore.case = T, storm.rec$EVTYPE), "Class"] <- "ligthning-related"
storm.rec[grep("flood|fld|tsunami|tide|surge|surf|waterspout", ignore.case = T,
storm.rec$EVTYPE), "Class"] <- "flood-related"
storm.rec[grep("rain|precip", ignore.case = T, storm.rec$EVTYPE), "Class"] <- "rain-related"
storm.rec[grep("rip", ignore.case = T, storm.rec$EVTYPE), "Class"] <- "ripCurr-related"
storm.rec[grep("avalanche|landslide", ignore.case = T, storm.rec$EVTYPE), "Class"] <- "avalanche-related"
storm.rec[grep("snow|blizzard|ice|glaze|winter|cold|wintry|drizzle|cool|hail|fog|vog|frost|freeze|chill",
ignore.case = T, storm.rec$EVTYPE), "Class"] <- "cold-related"
Missing values are a common problem with environmental data and so we check to se what proportion of the observations are missing (i.e. coded as NA). There are no missing values.
sum(!complete.cases(storm.rec))
## [1] 0
We also prepare some other summarized data structure, to simplify analysis. We summarize all the numerical quantity (Fatalities, Injuries, Damages) pro Class and pro Year.
# summarize to simplify plotting
events <- ddply(storm.rec, .(Class), summarise, TotFatal = sum(FATALITIES),
TotInjur = sum(INJURIES), TotDamage = sum(totalDamage))
events.year <- ddply(storm.rec, .(Class, Year), summarise, TotFatal = sum(FATALITIES),
TotInjur = sum(INJURIES), TotDamage = sum(totalDamage))
Now we are ready to analyze our data. The first and most important question is: what are the most important storm events, in terms of human and economic which have plagued the USA from 1996 to 2011? We then plot Fatalities, Injuries and Total damages.
par(mfrow = c(1, 3))
par(las = 2)
par(mar = c(7, 8, 2, 1))
# total fatalities
barplot(events[order(events$TotFatal), ]$TotFatal, names.arg = events[order(events$TotFatal),
]$Class, horiz = T, main = "Storm Fatalities", xlab = "Fatalities")
# total Injures
barplot(events[order(events$TotInjur), ]$TotInjur, names.arg = events[order(events$TotInjur),
]$Class, horiz = T, main = "Storm Injuries", xlab = "Injuries", axes = c(0))
axis(1, cex.axis = 0.9)
# total damage (in billions)
options(scipen = 999)
barplot(events[order(events$TotDamage), ]$TotDamage/1e+09, names.arg = events[order(events$TotDamage),
]$Class, horiz = T, main = "Storm Total Damages", xlab = "Damages (Billions $)")
"Fig. 1 - Analysis of the most harmful storm event"
As one can readily see from Figure 1, the human and economic cost is very significant. More in detail heat-, flood- and tornado-related event are the worst in terms of Fatalities and Injuries. Heat-related events are the deadliest, probably for the effects that these event have on the young and the elderly. Tornado-related event cause the greatest Injuries, the third greatest fatalities but are relatively light in term of damages.
Also relatively uncommon event, as rip current are responsible for many fatalities. In many cases small preventive measures could alleviate the human and economic toll caused by these events.
The other important question that we want to address is how much the situation changes in time, given also that the changing of climate could, in the opinion of some scientist, make storm event worse. We select the three most deadly events and plot them in time:
# now analyze the three biggest cause of fatalities
bd <- c("tornado-related", "flood-related", "heat-related")
bd.year <- events.year[events.year$Class %in% bd, ]
ggplot(bd.year, aes(x = Year, y = TotFatal, group = 1)) + geom_smooth(method = "loess") +
geom_point() + facet_grid(Class ~ .) + ggtitle("Fatalities caused by floods, tornadoes and heat related events") +
ylab("Total Fatalities")
"Fig. 2 - Time series analysis of the most harmful storm event"
As we could see from Figure 2, if we exclude the very deadly 2011 tornado season, the number of fatalities is relatively stable. Maybe the worsening of the climate is counterbalanced from measures taken at national and local level.
In the original database is also indicated where these events took place, so it is possible to repeat these analysis at the local level. It is only necessary to add filter at the geography fields (see the database documentation referenced above).