Synopsis

In the present data analysis we attempt to answer following questions:

From the course assignment web site we obtain data tracking charactertistics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage from 1950 to November 2011. There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.

Data Processing

library(data.table)

stormdata <- read.csv("StormData.csv", stringsAsFactors=FALSE)

After reading in the data, check the dataset.

dim(stormdata)
## [1] 902297     37

Cleaning the data and getting only the one related to our study.

dt <- stormdata[ , c("EVTYPE", "BGN_DATE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")]

dt$BGN_DATE <- as.Date(dt$BGN_DATE, "%m/%d/%Y %H:%M:%S")

str(dt)
## 'data.frame':    902297 obs. of  8 variables:
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_DATE  : Date, format: "1950-04-18" "1950-04-18" ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...

The variables PROPDMGEXP and CROPDMGEXP have the factor of multiplicity of the variables PROPDMG and CROPDMG with the values:

unique(dt$PROPDMGEXP)
##  [1] "K" "M" ""  "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-"
## [18] "1" "8"
unique(dt$CROPDMGEXP)
## [1] ""  "M" "K" "m" "B" "?" "0" "k" "2"

We create a new variables called PROPTOTALDMG and CROPTOTALDMG and TOTALDMG with the sum of PROPTOTALDMG and CROPTOTALDMG. We’ll use plyr library:

library(plyr)

tempPROPDMG <- mapvalues(dt$PROPDMGEXP,
c("K","M","", "B","m","+","0","5","6","?","4","2","3","h","7","H","-","1","8"), 
c(1e3,1e6, 1, 1e9,1e6,  1,  1,1e5,1e6,  1,1e4,1e2,1e3,  1,1e7,1e2,  1, 10,1e8))

tempCROPDMG <- mapvalues(dt$CROPDMGEXP,
c("","M","K","m","B","?","0","k","2"),
c( 1,1e6,1e3,1e6,1e9,1,1,1e3,1e2))

dt$PROPTOTALDMG <- as.numeric(tempPROPDMG) * dt$PROPDMG
dt$CROPTOTALDMG <- as.numeric(tempCROPDMG) * dt$CROPDMG

dt$TOTALDMG <- dt$PROPTOTALDMG + dt$CROPTOTALDMG

head(dt$TOTALDMG)
## [1] 25000  2500 25000  2500  2500  2500

Data analysis

Most harmful types of events with respect to population health across the United States

There are two measurements in the dataset can reflect the degree of harmfulness of a type of event with respect to population health: fatalities and injuries. Thus we sum them up over types of events to find out the most harmful type of event. Firstly, we summarize data about fatalities and injuries by type of event. And we create a total data frame.

total <- ddply(dt, .(EVTYPE), summarize, propdamage = sum(TOTALDMG), injuries= sum(INJURIES), fatalities = sum(FATALITIES), persdamage = sum(INJURIES)+sum(FATALITIES))

Graph of total of personal damage, between injuries and fatalities.

res <- total[order(total$persdamage, decreasing = TRUE),]
head(res,10)
##                EVTYPE   propdamage injuries fatalities persdamage
## 834           TORNADO  57362333947    91346       5633      96979
## 130    EXCESSIVE HEAT    500155700     6525       1903       8428
## 856         TSTM WIND   5038935845     6957        504       7461
## 170             FLOOD 150319678257     6789        470       7259
## 464         LIGHTNING    942471520     5230        816       6046
## 275              HEAT    403258500     2100        937       3037
## 153       FLASH FLOOD  18243991079     1777        978       2755
## 427         ICE STORM   8967041360     1975         89       2064
## 760 THUNDERSTORM WIND   3897965522     1488        133       1621
## 972      WINTER STORM   6715441251     1321        206       1527
library(ggplot2)

p1 <- ggplot(res[1:10,], aes(x=EVTYPE, y=persdamage, fill = EVTYPE)) 
p1 <- p1 + geom_bar(stat="identity", fill='red', color='black') 
p1 <- p1 + theme(axis.text.x = element_text(angle=90, hjust=1))
p1 <- p1 + labs(x = "", y = "personal damage (injuries and fatalities)")

plot(p1)

Dual graph showing injuries and fatalities.

res <- total[order(total$injuries, decreasing = TRUE),]

p2 <- ggplot(res[1:10,], aes(x=EVTYPE, y=injuries, fill = EVTYPE)) 
p2 <- p2 + geom_bar(stat="identity", fill='red', color='black') 
p2 <- p2 + theme(axis.text.x = element_text(angle=90, hjust=1))
p2 <- p2 + labs(x = "", y = "personal damage (injuries)")

res <- total[order(total$fatalities, decreasing = TRUE),]

p3 <- ggplot(res[1:10,], aes(x=EVTYPE, y=fatalities, fill = EVTYPE)) 
p3 <- p3 + geom_bar(stat="identity", fill='red', color='black') 
p3 <- p3 + theme(axis.text.x = element_text(angle=90, hjust=1))
p3 <- p3 + labs(x = "", y = "personal damage (fatalities)")

The easy way is to plot two plots use the multiplot function, defined at www.cookbook-r.com

# Multiple plot function
#
# ggplot objects can be passed in ..., or to plotlist (as a list of ggplot objects)
# - cols:   Number of columns in layout
# - layout: A matrix specifying the layout. If present, 'cols' is ignored.
#
# If the layout is something like matrix(c(1,2,3,3), nrow=2, byrow=TRUE),
# then plot 1 will go in the upper left, 2 will go in the upper right, and
# 3 will go all the way across the bottom.
#
multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
  library(grid)

  # Make a list from the ... arguments and plotlist
  plots <- c(list(...), plotlist)

  numPlots = length(plots)

  # If layout is NULL, then use 'cols' to determine layout
  if (is.null(layout)) {
    # Make the panel
    # ncol: Number of columns of plots
    # nrow: Number of rows needed, calculated from # of cols
    layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
                    ncol = cols, nrow = ceiling(numPlots/cols))
  }

 if (numPlots==1) {
    print(plots[[1]])

  } else {
    # Set up the page
    grid.newpage()
    pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))

    # Make each plot, in the correct location
    for (i in 1:numPlots) {
      # Get the i,j matrix positions of the regions that contain this subplot
      matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))

      print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
                                      layout.pos.col = matchidx$col))
    }
  }
}

multiplot(p2, p3, cols=1)

Types of events have the greatest economic consequences across the United States

Likewise, there are measurements judging economic losses, property damages and crop damages. Here we can add these two measurements up over differe event types to determine which one has the greatest economic consequences.

To begin with, we need to make sure all damages are with the same unit, i.e. million dollars, combining the information afforded by four variables propdmg, propdmgexp, cropdmg, cropdmgexp in the dataset. To perform this conversion, we assumed TOTALDMG variable and propdamage as a resulting value of property damage in $mln.

Graph of property damage in mln $USD.

res <- total[order(total$propdamage, decreasing = TRUE),]

p1 <- ggplot(res[1:10,], aes(x=EVTYPE, y=propdamage/10E6, fill = EVTYPE)) 
p1 <- p1 + geom_bar(stat="identity", fill='lightgreen', color='black') 
p1 <- p1 + theme(axis.text.x = element_text(angle=90, hjust=1))
p1 <- p1 + labs(x = "", y = "Property damage (mln $USD)")

plot(p1)

Results

This question is easy to answer if you see the graph. The TORNADOES are the most harmful according to population health with more than 90.000 people hurt or dead.

The FLOODS are the events with the greatest economic consequences, followed by HURRICANES/TYPHOONES.