Synopsis

Every year, severe weather events continue to affect the US population’s health and economy. In order to identify where to focus government preparations, this study investigates the basic questions into which severe weather events have the greatest impact to the population’s health and the US economy. The study uses data collected and provided by the US National Oceanic and Atmospheric Administration on severe weather events in the US over the period of Jan 1996 to Nov 2011. The analysis shows which weather events over the fifteen year period have caused the greatest fatalities, injuries and costs of damage to property and crops. Excessive heat events have caused the greatest number of deaths, however, tornadoes have caused the greatest number of injuries; overall, tornadoes has the biggest impact on the population in terms of both fatalities and injuries combined. Finally, flood events cost the US the greatest amount to the economy in terms of property and crop damage combined.

Data Processing

Data is downloaded from https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2 and prepared for analysis.

setwd("~/Coursera/Reproducible Research/Assignment 2")
##Checks to see if the file already exists, otherwise,
##downloads the file from fileurl
if(!file.exists("./stormdata.csv.bz2")){
                fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
        dfile <- "./data/stormdata.csv.bz2"
        download.file(fileUrl, destfile=dfile)
        dateDownloaded <- date()
        }

##read file
rawactdata <- read.csv("./stormdata.csv.bz2")

Variables that are only required for this analysis are kept. - BGN_DATE: The date and time the weather event began
- EVTYPE: The one of 48 types of weather events recorded
- FATALITIES: The number of deaths caused as a result of the weather event
- INJURIES: the number of injuries caused as a result of the weather event
- PROPDMG: Estimate of the cost in dollars of the damage to property caused by the weather event
- PROPDMGEXP: the prefix multiplier either k (thousands), m (millions) or b (billions) dollars to PROPDMG
- CROPDMG: Estimate of the cost in dollars of the damage to crops caused by the weather event
- CROPDMGEXP: the prefix multiplier either either k (thousands), m (millions) or b (billions) dollars to CROPDMG

subsetdata <- subset(rawdata, select = c(BGN_DATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP))

The date variable is converted into date format. (NB requires installation of Lubridate pacakage.) The data is subset again to create the final dataset by using date periods from Jan 1996 to Nov 2011. This is because data collected prior to 1996 was on only 3 different types of weather events, however after this date, data was collected on 48 different event types of weather events. See reference for further details: http://www.ncdc.noaa.gov/stormevents/details.jsp

##Convert dates to date variable
subsetdata$BGN_DATE <- as.character(subsetdata$BGN_DATE)
library(lubridate)
## Warning: package 'lubridate' was built under R version 3.1.1
subsetdata$BGN_DATE <- mdy_hms(subsetdata$BGN_DATE)
##subset dates required
fdata <- subset(subsetdata, BGN_DATE > (strptime("1996-01-01","%Y-%m-%d")))

Finally, data is converted to appropriate formats, ready for analysis.

##Prepare data for analysis
fdata$year <- factor(year(fdata$BGN_DATE))
fdata$EVTYPE <- factor(tolower(fdata$EVTYPE))
fdata$PROPDMGEXP <- factor(tolower(fdata$PROPDMGEXP))
fdata$CROPDMGEXP <- factor(tolower(fdata$CROPDMGEXP))

Results

A. Weather events most harmful to US population’s health

Events most harmful to the population’s health is considered in terms of events causing fatalities and injuries. The analysis looks at the impact of weather events on both fatalities and injuries separately, and then looks at the impact of weather events on fatalities and injuries combined.

The top five weather events that have caused the most number of injuries, with the greatest number of injuries caused by tornadoes:

##calculate sum of all injuries by event
suminjuries <-aggregate(fdatahealth$INJURIES, by=list(event = fdatahealth$EVTYPE), FUN=sum, na.rm=TRUE)
suminjuries <- suminjuries[order(suminjuries$x, decreasing = TRUE),]
##select top 5 events causing largest number of injuries
top5suminjury <- head(suminjuries,5)
top5suminjury
##              event     x
## 102        tornado 20667
## 30           flood  6758
## 22  excessive heat  6391
## 68       lightning  4140
## 105      tstm wind  3629

However, the top five weather events that have caused the most number of fatalities is slightly different, with the greatest number of fatalities caused by excessive heat events:

sumdeaths <- aggregate(fdatahealth$FATALITIES, by=list(event = fdatahealth$EVTYPE), FUN=sum, na.rm=TRUE)
sumdeaths <- sumdeaths[order(sumdeaths$x, decreasing = TRUE),]
##select top 5 events causing largest number of deaths
top5sumdeath <- head(sumdeaths,5)
top5sumdeath
##              event    x
## 22  excessive heat 1797
## 102        tornado 1511
## 29     flash flood  887
## 68       lightning  650
## 30           flood  414

In order to understand which weather event impacts the US population’s health, by injury or death, injuries and fatalities are totalled. The top 5 weather events that have caused the most number of both fatalities and deaths combined, with the greatest impact caused by tornadoes:

##calculate sum of all deaths and injuries
mergedimpact <- merge(sumdeaths,suminjuries, by.x="event", by.y="event")
colnames(mergedimpact) <- c("event", "sumdeaths", "suminjuries")
mergedimpact$totalimpact <- rowSums(mergedimpact[,c("sumdeaths","suminjuries")])
mergedimpact <- mergedimpact[order(mergedimpact$totalimpact, decreasing = TRUE),]
##select top 5 events casuing largest number of deaths and injuries
top5sumimpact <- head(mergedimpact, 5)
top5sumimpact
##              event sumdeaths suminjuries totalimpact
## 102        tornado      1511       20667       22178
## 22  excessive heat      1797        6391        8188
## 30           flood       414        6758        7172
## 68       lightning       650        4140        4790
## 105      tstm wind       241        3629        3870

This plot shows the findings summarised, by events causing greatest injuries, deaths and injuries & deaths combined. Note ggplot2 library is required.

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.1.1
##plot top 5 events causing largest number of injuries
top5suminjury$event <-factor(top5suminjury$event, levels=top5suminjury[order(top5suminjury$x, decreasing = TRUE), "event"])
suminjuriesplot <- ggplot(top5suminjury, aes(x=event, y=x)) + geom_bar(stat="identity") + 
        labs(title="Total # of Injuries by event", x="event", y="# of injuries") +
        coord_cartesian(ylim = c(0, 25000)) +
        theme(axis.text.x=element_text(size=8, angle=90, vjust=0.5))

##plot top 5 events causing largest number of deaths
top5sumdeath$event <-factor(top5sumdeath$event, levels=top5sumdeath[order(top5sumdeath$x, decreasing = TRUE), "event"])
sumdeathsplot <- ggplot(top5sumdeath, aes(x=event, y=x)) + geom_bar(stat="identity") + 
        labs(title="Total # of Fatalities by event", x="event", y="# of fatalities") +
        coord_cartesian(ylim = c(0, 25000)) +
        theme(axis.text.x=element_text(size=8, angle=90, vjust=0.5))

##plot top 5 events causing largest number of deaths
top5sumimpact$event <-factor(top5sumimpact$event, levels=top5sumimpact[order(top5sumimpact$totalimpact, decreasing = TRUE), "event"])
sumimpactplot <- ggplot(top5sumimpact, aes(x=event, y=totalimpact)) + geom_bar(stat="identity") + 
        labs(title="Total # of Injuries & Fatalities by event", x="event", y="#of fatalities + injuries") +
        coord_cartesian(ylim = c(0, 25000)) +
        theme(axis.text.x=element_text(size=8, angle=90, vjust=0.5))

####### Multiple plot function##########
##Sourced from: http://www.cookbook-r.com/Graphs/Multiple_graphs_on_one_page_%28ggplot2%29/
#
# ggplot objects can be passed in ..., or to plotlist (as a list of ggplot objects)
# - cols:   Number of columns in layout
# - layout: A matrix specifying the layout. If present, 'cols' is ignored.
#
# If the layout is something like matrix(c(1,2,3,3), nrow=2, byrow=TRUE),
# then plot 1 will go in the upper left, 2 will go in the upper right, and
# 3 will go all the way across the bottom.
#
multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
  require(grid)
  
  # Make a list from the ... arguments and plotlist
  plots <- c(list(...), plotlist)
  
  numPlots = length(plots)
  
  # If layout is NULL, then use 'cols' to determine layout
  if (is.null(layout)) {
    # Make the panel
    # ncol: Number of columns of plots
    # nrow: Number of rows needed, calculated from # of cols
    layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
                     ncol = cols, nrow = ceiling(numPlots/cols))
  }
  
  if (numPlots==1) {
    print(plots[[1]])
    
  } else {
    # Set up the page
    grid.newpage()
    pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))
    
    # Make each plot, in the correct location
    for (i in 1:numPlots) {
      # Get the i,j matrix positions of the regions that contain this subplot
      matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))
      
      print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
                                      layout.pos.col = matchidx$col))
    }
  }
}
####### End Multiple plot function##########

##plot multiple plot

multiplot(suminjuriesplot,sumimpactplot,sumdeathsplot, cols = 2)
## Loading required package: grid

plot of chunk unnamed-chunk-8

B. Weather events that have the greatest economic consequences

The top five weather events that have caused the greatest cost of property damage in dollar, with floods causing the greatest amount of damage costs:

##subset data where either property damage or crop damage occured
fdataecon <- subset(fdata, PROPDMG > 0 | CROPDMG > 0)
##set up new variable that holds the total cost of property damage as an integer variable
fdataecon$PROPDMGCOST <- NULL
##creates new integer variable that calculates the total cost of the property damage for a given event
##m = million $'s, k = thousand $'s, b = billion $'s
for (i in 1:nrow(fdataecon)){
        if (fdataecon$PROPDMGEXP[i] == "m") {
               fdataecon$PROPDMGCOST[i] <- fdataecon$PROPDMG[i] * 1000000  
        }
        else if (fdataecon$PROPDMGEXP[i] == "k") {
                fdataecon$PROPDMGCOST[i] <- fdataecon$PROPDMG[i] * 1000
        }
        else if (fdataecon$PROPDMGEXP[i] == "b") {
                fdataecon$PROPDMGCOST[i] <- fdataecon$PROPDMG[i] * 1000000000
        }
}
##calculate the total costs of property damage caused by event
sumpropdmg <- aggregate(fdataecon$PROPDMGCOST, by=list(event = fdataecon$EVTYPE), FUN=sum, na.rm=TRUE)
sumpropdmg <- sumpropdmg[order(sumpropdmg$x, decreasing = TRUE),]
top5propdmg <- head(sumpropdmg,5)
top5propdmg
##                 event         x
## 40              flood 1.440e+11
## 73  hurricane/typhoon 6.931e+10
## 115       storm surge 4.319e+10
## 121           tornado 2.469e+10
## 57               hail 1.536e+10

The top five weather events that have caused the greatest cost of damage to crops in dollars, with drought causing the greatest damage costs:

##set up new variable that holds the total cost of property damage as an integer variable
fdataecon$CROPDMGCOST <- NULL
##creates new integer variable that calculates the total cost of the crop damage for a given event
##m = million $'s, k = thousand $'s, b = billion $'s
for (i in 1:nrow(fdataecon)){
        if (fdataecon$CROPDMGEXP[i] == "m") {
                fdataecon$CROPDMGCOST[i] <- fdataecon$CROPDMG[i] * 1000000  
        }
        else if (fdataecon$CROPDMGEXP[i] == "k") {
                fdataecon$CROPDMGCOST[i] <- fdataecon$CROPDMG[i] * 1000
        }
        else if (fdataecon$CROPDMGEXP[i] == "b") {
                fdataecon$CROPDMGCOST[i] <- fdataecon$CROPDMG[i] * 1000000000
        }
}
##calculate the total costs of damage to crop caused by event
sumcropdmg <- aggregate(fdataecon$CROPDMGCOST, by=list(event = fdataecon$EVTYPE), FUN=sum, na.rm=TRUE)
sumcropdmg <- sumcropdmg[order(sumcropdmg$x, decreasing = TRUE),]
top5cropdmg <- head(sumcropdmg,5)
top5cropdmg
##         event         x
## 26    drought 1.337e+10
## 40      flood 5.100e+09
## 57       hail 2.829e+09
## 72  hurricane 2.744e+09
## 124 tstm wind 2.676e+09

The top five weather events that have caused the greatest cost of damage to both property and crops in dollars, with floods causing the greatest damage costs.

##merges property and crop dmg
mergedcosts <- merge(sumcropdmg,sumpropdmg,by.x="event", by.y="event")
colnames(mergedcosts) <- c("event", "sumcropdmg", "sumpropdmg")
mergedcosts$totaldmg <- rowSums(mergedcosts[,c("sumcropdmg","sumpropdmg")])
mergedcosts <- mergedcosts[order(mergedcosts$totaldmg, decreasing = TRUE),]
top5costs <- head(mergedcosts,5)
top5costs
##                 event sumcropdmg sumpropdmg  totaldmg
## 40              flood  5.100e+09  1.440e+11 1.491e+11
## 73  hurricane/typhoon  2.609e+09  6.931e+10 7.192e+10
## 115       storm surge  6.047e+06  4.319e+10 4.320e+10
## 121           tornado  5.222e+08  2.469e+10 2.521e+10
## 57               hail  2.829e+09  1.536e+10 1.819e+10

This plot shows the total cost of damage to both crops and property, by weather event.

##plot merged costs
top5costs$event <-factor(top5costs$event, levels=top5costs[order(top5costs$totaldmg, decreasing = TRUE), "event"])
plotcosts <- ggplot(top5costs, aes(x=event, y=totaldmg)) + geom_bar(stat="identity") + 
        labs(title="Total Cost of damage by weather event",x="event", y="total cost of damage($)")
plotcosts

plot of chunk unnamed-chunk-12

Future Analysis

There are two areas that a future analysis should address.
Firstly, the data recorded in EVTYPE, weather event type, should be a factor variable with 48 levels, however, the data was collected using a free text box, therefore there are a number of records incorrectly. Future analysis should refine the data collected in this field.
Secondly, due to time constraints, the analysis would improve the visual layout and format of the plots provided.