This project report was completed as a part of the COURSERA Data Science specialization course on Reproducible Research taught by Prof. Roger Peng at Johns Hopkins University

SYNOPSIS:

Analysis of the NOAA data set on damage from environmental disasters is carried out in this study to identify the type of natural disasters that cause the maximum damage to life and property. Based on the data collected from 1991 to 2011, it is found that excessive heat (1903) and tornadoes (1699) have killed the largest number of people in USA from 1991 to 2011 amongst all the listed disaster types. Tornadoes (25497) and floods (6789) are reponsible for the the maximum injuries to people in USA over this time period. Tornadoes (USD 1519172), flash floods (USD 1420124.6) and thunder storm winds (USD 1335965.6) caused the maximum property damage in USA over this period. Hail (USD 579596.28), flash floods (USD 179200.46) and floods (USD 168037.88) are main causes of crop damage from natural disasters.

Figures in brackets indicated measured impact value.

DATA PROCESSING

We start the analysis by loading the data set into the R workspace using the read.csv command in R. The loaded data is rendered into a data frame and the BGN_DATE column, which records the date of the event, is rendered into the Date format. All events before the year 1991 are filtered out since we are concerned only with the recent data.

##read the source data and render to a data frame
noaa<-read.csv("data/StormData.csv",header=TRUE)
noaa <- as.data.frame(noaa)

##convert the BGN_DATE column to the date format
noaa1 <- transform(noaa, BGN_DATE = as.Date(BGN_DATE,"%m/%d/%Y"))

##filter out events before 1991
noaa <- noaa1[noaa1$BGN_DATE > "1991-01-01",]

In the first half of the analysis, we want to analyze the impact on population health by the number of people killed and the number of people injured. For this purpose, we subset the data set to the relevant columns of event type, number of deaths and number of injuries.

noaa_subset <- noaa[,c("EVTYPE","FATALITIES","INJURIES")]
noaa_subset<- as.data.frame(noaa_subset)
noaa_subset <- transform(noaa_subset, EVTYPE = factor(EVTYPE))

We will use the plyr package to find aggregate values of total deaths for a given disaster type. These aggregate values of total people killed will be sorted in the descending order and the top 10 causes of death will be recorded.

library(plyr)
ev_fatal <- ddply(noaa_subset, .(EVTYPE), summarize, TOTAL_DEATHS = sum(FATALITIES))
ev_fatal <- ev_fatal[order(-ev_fatal$TOTAL_DEATHS),]
ev_fatal <- as.data.frame(ev_fatal)
fatal_top20 <- head(ev_fatal, 10)
fatal_top20
##             EVTYPE TOTAL_DEATHS
## 130 EXCESSIVE HEAT         1903
## 834        TORNADO         1699
## 153    FLASH FLOOD          978
## 275           HEAT          937
## 464      LIGHTNING          816
## 170          FLOOD          470
## 585    RIP CURRENT          368
## 856      TSTM WIND          285
## 359      HIGH WIND          248
## 19       AVALANCHE          224

Next we will try to identify which natural disasters cause the maximum injuries to people using the same approach as before.

ev_injuries <- ddply(noaa_subset, .(EVTYPE), summarize, TOTAL_INJURIES = sum(INJURIES))
ev_injuries <- ev_injuries[order(-ev_injuries$TOTAL_INJURIES),]
ev_injuries <- as.data.frame(ev_injuries)
injuries_top20 <- head(ev_injuries, 10)
injuries_top20
##                EVTYPE TOTAL_INJURIES
## 834           TORNADO          25497
## 170             FLOOD           6789
## 130    EXCESSIVE HEAT           6525
## 464         LIGHTNING           5230
## 856         TSTM WIND           4441
## 275              HEAT           2100
## 427         ICE STORM           1975
## 153       FLASH FLOOD           1777
## 760 THUNDERSTORM WIND           1488
## 972      WINTER STORM           1321

We will use a multi plot function from the R COOK BOOK to make a figure having the 2 plots showing the total deaths and injuries by natural disasters.

# Multiple plot function
# Reference: R COOK BOOK
# http://www.cookbook-r.com/Graphs/Multiple_graphs_on_one_page_(ggplot2)/
# ggplot objects can be passed in ..., or to plotlist (as a list of ggplot objects)
# - cols:   Number of columns in layout
# - layout: A matrix specifying the layout. If present, 'cols' is ignored.
#
# If the layout is something like matrix(c(1,2,3,3), nrow=2, byrow=TRUE),
# then plot 1 will go in the upper left, 2 will go in the upper right, and
# 3 will go all the way across the bottom.
# 
multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
  library(grid)
  
  # Make a list from the ... arguments and plotlist
  plots <- c(list(...), plotlist)
  
  numPlots = length(plots)
  
  # If layout is NULL, then use 'cols' to determine layout
  if (is.null(layout)) {
    # Make the panel
    # ncol: Number of columns of plots
    # nrow: Number of rows needed, calculated from # of cols
    layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
                     ncol = cols, nrow = ceiling(numPlots/cols))
  }
  
  if (numPlots==1) {
    print(plots[[1]])
    
  } else {
    # Set up the page
    grid.newpage()
    pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))
    
    # Make each plot, in the correct location
    for (i in 1:numPlots) {
      # Get the i,j matrix positions of the regions that contain this subplot
      matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))
      
      print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
                                      layout.pos.col = matchidx$col))
    }
  }
}

We will use the GGPLOT library to make the plots as shown below. The plots are presented in the results section of this report.

This completes the first half of our analysis. In the second half, we have to identify which natural disasters cause the maximum damage to property and crops. We will follow the same approach as before and subset the NOAA dataset to the columns which record the total property and crop damage in the first step.

noaa_subset_prop <- noaa[,c("EVTYPE","PROPDMG","CROPDMG")]
noaa_subset_prop <- as.data.frame(noaa_subset_prop)
noaa_subset_prop <- transform(noaa_subset_prop, EVTYPE = factor(EVTYPE))

As before, we will use the PLYR library to calculate total property and crop damage aggregated by the event type. We will order the data frame in decreasing order of total damage and plot the top 10 entries.

##TOP DISASTERS BY PROPERTY DAMAGE
ev_prop <- ddply(noaa_subset_prop, .(EVTYPE), summarize, TOTAL_PROP = sum(PROPDMG))
ev_prop <- ev_prop[order(-ev_prop$TOTAL_PROP),]
ev_prop <- as.data.frame(ev_prop)

prop_top10 <- head(ev_prop, 10)
prop_top10
##                 EVTYPE TOTAL_PROP
## 834            TORNADO  1519172.4
## 153        FLASH FLOOD  1420124.6
## 856          TSTM WIND  1335965.6
## 170              FLOOD   899938.5
## 760  THUNDERSTORM WIND   876844.2
## 244               HAIL   688693.4
## 464          LIGHTNING   603351.8
## 786 THUNDERSTORM WINDS   446293.2
## 359          HIGH WIND   324731.6
## 972       WINTER STORM   132720.6
##TOP DISASTERS BY CROP DAMAGE
ev_crop <- ddply(noaa_subset_prop, .(EVTYPE), summarize, TOTAL_CROP = sum(CROPDMG))
ev_crop <- ev_crop[order(-ev_crop$TOTAL_CROP),]
ev_crop <- as.data.frame(ev_crop)

crop_top10 <- head(ev_crop, 10)
crop_top10
##                 EVTYPE TOTAL_CROP
## 244               HAIL  579596.28
## 153        FLASH FLOOD  179200.46
## 170              FLOOD  168037.88
## 856          TSTM WIND  109202.60
## 834            TORNADO  100018.52
## 760  THUNDERSTORM WIND   66791.45
## 95             DROUGHT   33898.62
## 786 THUNDERSTORM WINDS   18684.93
## 359          HIGH WIND   17283.21
## 290         HEAVY RAIN   11122.80

The plotting is done using the GGPLOT library as in the previous case.

RESULTS

The first 2 plots below indicate the total deaths and injuries caused by the top 10 natural disasters in USA from 1991 to 2011. The next 2 plots indicate the total damage to property and crops by the top 10 natural disasters in USA from 1991 to 2011.

library(ggplot2)

q1 <- qplot(TOTAL_DEATHS, EVTYPE, data=fatal_top20, main="TOP 10 NATURAL DISASTERS BY FATALITIES", xlab="TOTAL FATALITIES",ylab="DISASTER TYPE")

q2 <- qplot(log10(TOTAL_INJURIES), EVTYPE,  data=injuries_top20, main="TOP 10 NATURAL DISASTERS BY INJURIES", xlab="TOP INJURIES LOG SCALE",ylab="DISASTER TYPE")

multiplot(q1, q2, cols=1)

q3 <- qplot(log10(TOTAL_PROP), EVTYPE,  data=prop_top10, main="TOP 10 NATURAL DISASTERS BY PROPERTY DAMAGE", xlab="TOTAL PROPERTY DAMAGE LOG SCALE",ylab="DISASTER TYPE")

q4 <- qplot(log10(TOTAL_CROP), EVTYPE, data=crop_top10, main="TOP 10 NATURAL DISASTERS BY CROP DAMAGE", xlab="TOTAL CROP DAMAGE LOG SCALE",ylab="DISASTER TYPE")

multiplot(q3, q4, cols=1)