Analysis of the NOAA data set on damage from environmental disasters is carried out in this study to identify the type of natural disasters that cause the maximum damage to life and property. Based on the data collected from 1991 to 2011, it is found that excessive heat (1903) and tornadoes (1699) have killed the largest number of people in USA from 1991 to 2011 amongst all the listed disaster types. Tornadoes (25497) and floods (6789) are reponsible for the the maximum injuries to people in USA over this time period. Tornadoes (USD 1519172), flash floods (USD 1420124.6) and thunder storm winds (USD 1335965.6) caused the maximum property damage in USA over this period. Hail (USD 579596.28), flash floods (USD 179200.46) and floods (USD 168037.88) are main causes of crop damage from natural disasters.
We start the analysis by loading the data set into the R workspace using the read.csv command in R. The loaded data is rendered into a data frame and the BGN_DATE column, which records the date of the event, is rendered into the Date format. All events before the year 1991 are filtered out since we are concerned only with the recent data.
##read the source data and render to a data frame
noaa<-read.csv("data/StormData.csv",header=TRUE)
noaa <- as.data.frame(noaa)
##convert the BGN_DATE column to the date format
noaa1 <- transform(noaa, BGN_DATE = as.Date(BGN_DATE,"%m/%d/%Y"))
##filter out events before 1991
noaa <- noaa1[noaa1$BGN_DATE > "1991-01-01",]
In the first half of the analysis, we want to analyze the impact on population health by the number of people killed and the number of people injured. For this purpose, we subset the data set to the relevant columns of event type, number of deaths and number of injuries.
noaa_subset <- noaa[,c("EVTYPE","FATALITIES","INJURIES")]
noaa_subset<- as.data.frame(noaa_subset)
noaa_subset <- transform(noaa_subset, EVTYPE = factor(EVTYPE))
We will use the plyr package to find aggregate values of total deaths for a given disaster type. These aggregate values of total people killed will be sorted in the descending order and the top 10 causes of death will be recorded.
library(plyr)
ev_fatal <- ddply(noaa_subset, .(EVTYPE), summarize, TOTAL_DEATHS = sum(FATALITIES))
ev_fatal <- ev_fatal[order(-ev_fatal$TOTAL_DEATHS),]
ev_fatal <- as.data.frame(ev_fatal)
fatal_top20 <- head(ev_fatal, 10)
fatal_top20
## EVTYPE TOTAL_DEATHS
## 130 EXCESSIVE HEAT 1903
## 834 TORNADO 1699
## 153 FLASH FLOOD 978
## 275 HEAT 937
## 464 LIGHTNING 816
## 170 FLOOD 470
## 585 RIP CURRENT 368
## 856 TSTM WIND 285
## 359 HIGH WIND 248
## 19 AVALANCHE 224
Next we will try to identify which natural disasters cause the maximum injuries to people using the same approach as before.
ev_injuries <- ddply(noaa_subset, .(EVTYPE), summarize, TOTAL_INJURIES = sum(INJURIES))
ev_injuries <- ev_injuries[order(-ev_injuries$TOTAL_INJURIES),]
ev_injuries <- as.data.frame(ev_injuries)
injuries_top20 <- head(ev_injuries, 10)
injuries_top20
## EVTYPE TOTAL_INJURIES
## 834 TORNADO 25497
## 170 FLOOD 6789
## 130 EXCESSIVE HEAT 6525
## 464 LIGHTNING 5230
## 856 TSTM WIND 4441
## 275 HEAT 2100
## 427 ICE STORM 1975
## 153 FLASH FLOOD 1777
## 760 THUNDERSTORM WIND 1488
## 972 WINTER STORM 1321
We will use a multi plot function from the R COOK BOOK to make a figure having the 2 plots showing the total deaths and injuries by natural disasters.
# Multiple plot function
# Reference: R COOK BOOK
# http://www.cookbook-r.com/Graphs/Multiple_graphs_on_one_page_(ggplot2)/
# ggplot objects can be passed in ..., or to plotlist (as a list of ggplot objects)
# - cols: Number of columns in layout
# - layout: A matrix specifying the layout. If present, 'cols' is ignored.
#
# If the layout is something like matrix(c(1,2,3,3), nrow=2, byrow=TRUE),
# then plot 1 will go in the upper left, 2 will go in the upper right, and
# 3 will go all the way across the bottom.
#
multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
library(grid)
# Make a list from the ... arguments and plotlist
plots <- c(list(...), plotlist)
numPlots = length(plots)
# If layout is NULL, then use 'cols' to determine layout
if (is.null(layout)) {
# Make the panel
# ncol: Number of columns of plots
# nrow: Number of rows needed, calculated from # of cols
layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
ncol = cols, nrow = ceiling(numPlots/cols))
}
if (numPlots==1) {
print(plots[[1]])
} else {
# Set up the page
grid.newpage()
pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))
# Make each plot, in the correct location
for (i in 1:numPlots) {
# Get the i,j matrix positions of the regions that contain this subplot
matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))
print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
layout.pos.col = matchidx$col))
}
}
}
We will use the GGPLOT library to make the plots as shown below. The plots are presented in the results section of this report.
This completes the first half of our analysis. In the second half, we have to identify which natural disasters cause the maximum damage to property and crops. We will follow the same approach as before and subset the NOAA dataset to the columns which record the total property and crop damage in the first step.
noaa_subset_prop <- noaa[,c("EVTYPE","PROPDMG","CROPDMG")]
noaa_subset_prop <- as.data.frame(noaa_subset_prop)
noaa_subset_prop <- transform(noaa_subset_prop, EVTYPE = factor(EVTYPE))
As before, we will use the PLYR library to calculate total property and crop damage aggregated by the event type. We will order the data frame in decreasing order of total damage and plot the top 10 entries.
##TOP DISASTERS BY PROPERTY DAMAGE
ev_prop <- ddply(noaa_subset_prop, .(EVTYPE), summarize, TOTAL_PROP = sum(PROPDMG))
ev_prop <- ev_prop[order(-ev_prop$TOTAL_PROP),]
ev_prop <- as.data.frame(ev_prop)
prop_top10 <- head(ev_prop, 10)
prop_top10
## EVTYPE TOTAL_PROP
## 834 TORNADO 1519172.4
## 153 FLASH FLOOD 1420124.6
## 856 TSTM WIND 1335965.6
## 170 FLOOD 899938.5
## 760 THUNDERSTORM WIND 876844.2
## 244 HAIL 688693.4
## 464 LIGHTNING 603351.8
## 786 THUNDERSTORM WINDS 446293.2
## 359 HIGH WIND 324731.6
## 972 WINTER STORM 132720.6
##TOP DISASTERS BY CROP DAMAGE
ev_crop <- ddply(noaa_subset_prop, .(EVTYPE), summarize, TOTAL_CROP = sum(CROPDMG))
ev_crop <- ev_crop[order(-ev_crop$TOTAL_CROP),]
ev_crop <- as.data.frame(ev_crop)
crop_top10 <- head(ev_crop, 10)
crop_top10
## EVTYPE TOTAL_CROP
## 244 HAIL 579596.28
## 153 FLASH FLOOD 179200.46
## 170 FLOOD 168037.88
## 856 TSTM WIND 109202.60
## 834 TORNADO 100018.52
## 760 THUNDERSTORM WIND 66791.45
## 95 DROUGHT 33898.62
## 786 THUNDERSTORM WINDS 18684.93
## 359 HIGH WIND 17283.21
## 290 HEAVY RAIN 11122.80
The plotting is done using the GGPLOT library as in the previous case.
The first 2 plots below indicate the total deaths and injuries caused by the top 10 natural disasters in USA from 1991 to 2011. The next 2 plots indicate the total damage to property and crops by the top 10 natural disasters in USA from 1991 to 2011.
library(ggplot2)
q1 <- qplot(TOTAL_DEATHS, EVTYPE, data=fatal_top20, main="TOP 10 NATURAL DISASTERS BY FATALITIES", xlab="TOTAL FATALITIES",ylab="DISASTER TYPE")
q2 <- qplot(log10(TOTAL_INJURIES), EVTYPE, data=injuries_top20, main="TOP 10 NATURAL DISASTERS BY INJURIES", xlab="TOP INJURIES LOG SCALE",ylab="DISASTER TYPE")
multiplot(q1, q2, cols=1)
q3 <- qplot(log10(TOTAL_PROP), EVTYPE, data=prop_top10, main="TOP 10 NATURAL DISASTERS BY PROPERTY DAMAGE", xlab="TOTAL PROPERTY DAMAGE LOG SCALE",ylab="DISASTER TYPE")
q4 <- qplot(log10(TOTAL_CROP), EVTYPE, data=crop_top10, main="TOP 10 NATURAL DISASTERS BY CROP DAMAGE", xlab="TOTAL CROP DAMAGE LOG SCALE",ylab="DISASTER TYPE")
multiplot(q3, q4, cols=1)