In the present data analysis we attempt to answer following questions:
Across the United States, which types of events are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?
From the course assignment web site we obtain data tracking charactertistics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage from 1950 to November 2011. There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.
library(data.table)
stormdata <- read.csv("StormData.csv", stringsAsFactors=FALSE)
After reading in the data, check the dataset.
dim(stormdata)
## [1] 902297 37
Cleaning the data and getting only the one related to our study.
dt <- stormdata[ , c("EVTYPE", "BGN_DATE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")]
dt$BGN_DATE <- as.Date(dt$BGN_DATE, "%m/%d/%Y %H:%M:%S")
str(dt)
## 'data.frame': 902297 obs. of 8 variables:
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_DATE : Date, format: "1950-04-18" "1950-04-18" ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
The variables PROPDMGEXP and CROPDMGEXP have the factor of multiplicity of the variables PROPDMG and CROPDMG with the values:
H -> 1^10e2
K -> 1^10e3
M -> 1^10e6
B -> 1^10e9
numeric -> 1^10e(numeric)
unique(dt$PROPDMGEXP)
## [1] "K" "M" "" "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-"
## [18] "1" "8"
unique(dt$CROPDMGEXP)
## [1] "" "M" "K" "m" "B" "?" "0" "k" "2"
We create a new variables called PROPTOTALDMG and CROPTOTALDMG and TOTALDMG with the sum of PROPTOTALDMG and CROPTOTALDMG. We’ll use plyr library:
library(plyr)
tempPROPDMG <- mapvalues(dt$PROPDMGEXP,
c("K","M","", "B","m","+","0","5","6","?","4","2","3","h","7","H","-","1","8"),
c(1e3,1e6, 1, 1e9,1e6, 1, 1,1e5,1e6, 1,1e4,1e2,1e3, 1,1e7,1e2, 1, 10,1e8))
tempCROPDMG <- mapvalues(dt$CROPDMGEXP,
c("","M","K","m","B","?","0","k","2"),
c( 1,1e6,1e3,1e6,1e9,1,1,1e3,1e2))
dt$PROPTOTALDMG <- as.numeric(tempPROPDMG) * dt$PROPDMG
dt$CROPTOTALDMG <- as.numeric(tempCROPDMG) * dt$CROPDMG
dt$TOTALDMG <- dt$PROPTOTALDMG + dt$CROPTOTALDMG
head(dt$TOTALDMG)
## [1] 25000 2500 25000 2500 2500 2500
There are two measurements in the dataset can reflect the degree of harmfulness of a type of event with respect to population health: fatalities and injuries. Thus we sum them up over types of events to find out the most harmful type of event. Firstly, we summarize data about fatalities and injuries by type of event. And we create a total data frame.
total <- ddply(dt, .(EVTYPE), summarize, propdamage = sum(TOTALDMG), injuries= sum(INJURIES), fatalities = sum(FATALITIES), persdamage = sum(INJURIES)+sum(FATALITIES))
Graph of total of personal damage, between injuries and fatalities.
res <- total[order(total$persdamage, decreasing = TRUE),]
head(res,10)
## EVTYPE propdamage injuries fatalities persdamage
## 834 TORNADO 57362333947 91346 5633 96979
## 130 EXCESSIVE HEAT 500155700 6525 1903 8428
## 856 TSTM WIND 5038935845 6957 504 7461
## 170 FLOOD 150319678257 6789 470 7259
## 464 LIGHTNING 942471520 5230 816 6046
## 275 HEAT 403258500 2100 937 3037
## 153 FLASH FLOOD 18243991079 1777 978 2755
## 427 ICE STORM 8967041360 1975 89 2064
## 760 THUNDERSTORM WIND 3897965522 1488 133 1621
## 972 WINTER STORM 6715441251 1321 206 1527
library(ggplot2)
p1 <- ggplot(res[1:10,], aes(x=EVTYPE, y=persdamage, fill = EVTYPE))
p1 <- p1 + geom_bar(stat="identity", fill='red', color='black')
p1 <- p1 + theme(axis.text.x = element_text(angle=90, hjust=1))
p1 <- p1 + labs(x = "", y = "personal damage (injuries and fatalities)")
plot(p1)
Dual graph showing injuries and fatalities.
res <- total[order(total$injuries, decreasing = TRUE),]
p2 <- ggplot(res[1:10,], aes(x=EVTYPE, y=injuries, fill = EVTYPE))
p2 <- p2 + geom_bar(stat="identity", fill='red', color='black')
p2 <- p2 + theme(axis.text.x = element_text(angle=90, hjust=1))
p2 <- p2 + labs(x = "", y = "personal damage (injuries)")
res <- total[order(total$fatalities, decreasing = TRUE),]
p3 <- ggplot(res[1:10,], aes(x=EVTYPE, y=fatalities, fill = EVTYPE))
p3 <- p3 + geom_bar(stat="identity", fill='red', color='black')
p3 <- p3 + theme(axis.text.x = element_text(angle=90, hjust=1))
p3 <- p3 + labs(x = "", y = "personal damage (fatalities)")
The easy way is to plot two plots use the multiplot function, defined at www.cookbook-r.com
# Multiple plot function
#
# ggplot objects can be passed in ..., or to plotlist (as a list of ggplot objects)
# - cols: Number of columns in layout
# - layout: A matrix specifying the layout. If present, 'cols' is ignored.
#
# If the layout is something like matrix(c(1,2,3,3), nrow=2, byrow=TRUE),
# then plot 1 will go in the upper left, 2 will go in the upper right, and
# 3 will go all the way across the bottom.
#
multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
library(grid)
# Make a list from the ... arguments and plotlist
plots <- c(list(...), plotlist)
numPlots = length(plots)
# If layout is NULL, then use 'cols' to determine layout
if (is.null(layout)) {
# Make the panel
# ncol: Number of columns of plots
# nrow: Number of rows needed, calculated from # of cols
layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
ncol = cols, nrow = ceiling(numPlots/cols))
}
if (numPlots==1) {
print(plots[[1]])
} else {
# Set up the page
grid.newpage()
pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))
# Make each plot, in the correct location
for (i in 1:numPlots) {
# Get the i,j matrix positions of the regions that contain this subplot
matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))
print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
layout.pos.col = matchidx$col))
}
}
}
multiplot(p2, p3, cols=1)
Likewise, there are measurements judging economic losses, property damages and crop damages. Here we can add these two measurements up over differe event types to determine which one has the greatest economic consequences.
To begin with, we need to make sure all damages are with the same unit, i.e. million dollars, combining the information afforded by four variables propdmg, propdmgexp, cropdmg, cropdmgexp in the dataset. To perform this conversion, we assumed TOTALDMG variable and propdamage as a resulting value of property damage in $mln.
Graph of property damage in mln $USD.
res <- total[order(total$propdamage, decreasing = TRUE),]
p1 <- ggplot(res[1:10,], aes(x=EVTYPE, y=propdamage/10E6, fill = EVTYPE))
p1 <- p1 + geom_bar(stat="identity", fill='lightgreen', color='black')
p1 <- p1 + theme(axis.text.x = element_text(angle=90, hjust=1))
p1 <- p1 + labs(x = "", y = "Property damage (mln $USD)")
plot(p1)
This question is easy to answer if you see the graph. The TORNADOES are the most harmful according to population health with more than 90.000 people hurt or dead.
The FLOODS are the events with the greatest economic consequences, followed by HURRICANES/TYPHOONES.