Reproducible Research Peer Assessment 2: Exploring U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database

Synopsis

This analysis explores the NOAA Storm Database and demonstartes the inpact of severe weather events on people and property. The database covers the years 1950 to November 2011. The earlier years of the database contain fewer events, most likely due to a lack of good records. Recent years appear more complete.

Data analysis must address the following questions:

Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

Across the United States, which types of events have the greatest economic consequences?   

Consider writing your report as if it were to be read by a government or municipal manager who might be responsible for preparing for severe weather events and will need to prioritize resources for different types of events. However, there is no need to make any specific recommendations in your report.

Data Processing

Data

The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file from the course web site:

There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.

National Weather Service Storm Data Documentation

National Climatic Data Center Storm Events FAQ

The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.

The following packages are required to manipulate data and generate charts:

library(ggplot2)
library(reshape2)
library(knitr)
library(grid)
library(gridExtra)
library(data.table)

Loading the data

load storm data into data frame.

fileurl<-"http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
filebz<-"repdata-data-StormData.csv.bz2"
filecsv<-"repdata-data-StormData.csv"

if ( !file.exists(filecsv) )  {
    if ( !file.exists(filebz) )  
        download.file(fileurl,destfile=filebz)
    con<-bzfile(filebz,open="r")
}  else
    con<-file(filecsv,open="r")
rawdata<-read.csv(con,stringsAsFactors=FALSE)
close(con)

dim(rawdata)
## [1] 902297     37
## rawdata <- read.csv("repdata_data_StormData.csv" ,header= TRUE, sep = ",", stringsAsFactors = F)
## rawdata = read.table(bzfile("repdata_data_StormData.csv.bz2", open = "rt"))  

Processing Data

Transforming Property Damage to a Dollar value and cleaning punctuations where numric is expected.

## trim the columns to the necessary only
data <- data.table(rawdata[, c(2, 8, 23, 24, 25, 26, 27, 28)])
str(data)
## Classes 'data.table' and 'data.frame':   902297 obs. of  8 variables:
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  - attr(*, ".internal.selfref")=<externalptr>
# Covert to lower so to avoid inaccuracy in the data and simplify reclassification
data$EVTYPE <- tolower(data$EVTYPE)
# There are a lot of the terminology contains mixed types, that cross boudaries, and may be difficult to attribute to one or the other types, 
# Here we try to reclassify a lot of these mixed bags along higher definition
length(unique(data$EVTYPE))
## [1] 898
# This is very large number
data[grepl("(.*)surf(.*)", data$EVTYPE)==T,2] <- "coastal"
data[grepl("(.*)swell(.*)", data$EVTYPE)==T,2] <- "coastal"
data[grepl("(.*)sea(.*)", data$EVTYPE)==T,2] <- "coastal"
data[grepl("(.*)tide(.*)", data$EVTYPE)==T,2] <- "coastal"
data[grepl("(.*)wave(.*)", data$EVTYPE)==T,2] <- "coastal"
data[grepl("marine(.*)", data$EVTYPE)==T,2] <- "coastal"
data[grepl("seiche(.*)", data$EVTYPE)==T,2] <- "coastal"
data[grepl("beach(.*)", data$EVTYPE)==T,2] <- "coastal"
data[grepl("tide(.*)", data$EVTYPE)==T,2] <- "coastal"
data[grepl("coastal(.*)", data$EVTYPE)==T,2] <- "coastal"
data[grepl("c(.*)st(.*)l(.*)flood(.*)", data$EVTYPE)==T,2] <- "coastal"
data[grepl("thunders(.*)", data$EVTYPE)==T,2] <- "thunderstorm"
data[grepl("tstm(.*)", data$EVTYPE)==T,2] <- "thunderstorm"
data[grepl("hail(.*)", data$EVTYPE)==T,2] <- "hail"
data[grepl("(.*)wind(.*)", data$EVTYPE)==T,2] <- "wind"
data[grepl("(.*)rain(.*)", data$EVTYPE)==T,2] <- "rain"
data[grepl("(.*)snow(.*)", data$EVTYPE)==T,2] <- "ice snow"
data[grepl("(.*)ice(.*)", data$EVTYPE)==T,2] <- "ice snow"
data[grepl("(.*)freez(.*)", data$EVTYPE)==T,2] <- "ice snow"
data[grepl("wind(.*)", data$EVTYPE)==T,2] <- "wind"
data[grepl("rain(.*)", data$EVTYPE)==T,2] <- "rain"
data[grepl("snow(.*)", data$EVTYPE)==T,2] <- "ice snow"
data[grepl("ice(.*)", data$EVTYPE)==T,2] <- "ice snow"
data[grepl("freez(.*)", data$EVTYPE)==T,2] <- "ice snow"
data[grepl("(.*)floo(.*)", data$EVTYPE)==T,2] <- "flood"
data[grepl("floo(.*)", data$EVTYPE)==T,2] <- "flood"
data[grepl("(.*)snow(.*)", data$EVTYPE)==T,2] <- "flood"
data[grepl("(.*)hur(.*)", data$EVTYPE)==T,2] <- "hurricane"
data[grepl("hur(.*)", data$EVTYPE)==T,2] <- "hurricane"
data[grepl("(.*)heat(.*)", data$EVTYPE)==T,2] <- "heat"
data[grepl("heat(.*)", data$EVTYPE)==T,2] <- "heat"
data[grepl("(.*)cold(.*)", data$EVTYPE)==T,2] <- "cold"
data[grepl("cold(.*)", data$EVTYPE)==T,2] <- "cold"
data[grepl("(.*)dry(.*)", data$EVTYPE)==T,2] <- "drought"
data[grepl("dry(.*)", data$EVTYPE)==T,2] <- "drought"
data[grepl("(.*)tropical(.*)", data$EVTYPE)==T,2] <- "tropical"
data[grepl("tropical(.*)", data$EVTYPE)==T,2] <- "tropical"
data[grepl("(.*)tornad(.*)", data$EVTYPE)==T,2] <- "tornado"
data[grepl("tornad(.*)", data$EVTYPE)==T,2] <- "tornado"
data[grepl("(.*)wint(.*)", data$EVTYPE)==T,2] <- "wintery"
data[grepl("wint(.*)", data$EVTYPE)==T,2] <- "wintery"
data[grepl("(.*)lide(.*)", data$EVTYPE)==T,2] <- "landslides"
data[grepl("(.*)igh temp(.*)", data$EVTYPE)==T,2] <- "heat"
data[grepl("(.*)ightn(.*)", data$EVTYPE)==T,2] <- "lightning"
data[grepl("(.*)ust(.*)", data$EVTYPE)==T,2] <- "dust"
data[grepl("(.*)wat(.*)", data$EVTYPE)==T,2] <- "flood"
data[grepl("wet(.*)", data$EVTYPE)==T,2] <- "wintery"
data[grepl("(.*)ire(.*)", data$EVTYPE)==T,2] <- "fires"
data[grepl("funnel(.*)", data$EVTYPE)==T,2] <- "wind"
data <- data[grepl("summary(.*)", data$EVTYPE)==F,] 
data <- data[grepl("apache county", data$EVTYPE)==F,] 
data <- data[grepl("southeast", data$EVTYPE)==F,] 
data <- data[grepl("monthly(.*)", data$EVTYPE)==F,] 
length(unique(data$EVTYPE))
## [1] 119
# Althought we could further reduce the ```EVTYPE```, but it appears the higer values have already been classified correctly
data$EVTYPE <- toupper(data$EVTYPE)
data$FATALITIES <- as.numeric(data$FATALITIES)

Transforming Property Damage to a Dollar value and cleaning punctuations where numric is expected.

data$PROPDMGEXP <- as.character(data$PROPDMGEXP)
data$PROPDMGEXP <- toupper(data$PROPDMGEXP)
data$PROPDMGEXP[data$PROPDMGEXP == "" | data$PROPDMGEXP == "+" | data$PROPDMGEXP == "?" | data$PROPDMGEXP == "-"] <- "0"
data$PROPDMGEXP[data$PROPDMGEXP == "H" ] <- "100"
data$PROPDMGEXP[data$PROPDMGEXP == "K" ] <- "1000"
data$PROPDMGEXP[data$PROPDMGEXP == "M" ] <- "1000000"
data$PROPDMGEXP[data$PROPDMGEXP == "B" ] <- "1000000000"
data$PROPDMGEXP <- as.numeric(data$PROPDMGEXP)
data$PROPDMG_USD <- data$PROPDMG * data$PROPDMGEXP

Transforming Crop Damage to a Dollar value and cleaning punctuations where numric is expected.

data$CROPDMGEXP <- as.character(data$CROPDMGEXP)
data$CROPDMGEXP <- toupper(data$CROPDMGEXP)
data$CROPDMGEXP[data$CROPDMGEXP == "" | data$CROPDMGEXP == "?"] <- "0"
data$CROPDMGEXP[data$CROPDMGEXP == "B" ] <- "1000000000"
data$CROPDMGEXP[data$CROPDMGEXP == "M" ] <- "1000000"
data$CROPDMGEXP[data$CROPDMGEXP == "K" ] <- "1000"
# Take out all zeros
data$CROPDMGEXP <- as.numeric(data$CROPDMGEXP)
data$CROPDMG_USD <- data$CROPDMG * data$CROPDMGEXP

Data Analysis

Aggregating the data by Event Type (EVTYPE) by summarizing Fatalities, Injuries, Cost of Property Damage & Cost of Crop Damage

## Aggregating the 4 key variables for summary and plotting purposes

ImpactSummary<-aggregate(cbind(FATALITIES,INJURIES,PROPDMG_USD,CROPDMG_USD)~EVTYPE,data=data,FUN=sum,na.action=na.omit)
summary(ImpactSummary)
##     EVTYPE            FATALITIES      INJURIES      PROPDMG_USD      
##  Length:119         Min.   :   0   Min.   :    0   Min.   :0.00e+00  
##  Class :character   1st Qu.:   0   1st Qu.:    0   1st Qu.:0.00e+00  
##  Mode  :character   Median :   0   Median :    0   Median :0.00e+00  
##                     Mean   : 127   Mean   : 1181   Mean   :3.59e+09  
##                     3rd Qu.:   2   3rd Qu.:    4   3rd Qu.:5.35e+05  
##                     Max.   :5636   Max.   :91407   Max.   :1.72e+11  
##   CROPDMG_USD      
##  Min.   :0.00e+00  
##  1st Qu.:0.00e+00  
##  Median :0.00e+00  
##  Mean   :4.13e+08  
##  3rd Qu.:0.00e+00  
##  Max.   :1.93e+10
str(ImpactSummary)
## 'data.frame':    119 obs. of  5 variables:
##  $ EVTYPE     : chr  "?" "ABNORMAL WARMTH" "AVALANCE" "AVALANCHE" ...
##  $ FATALITIES : num  0 0 1 224 0 101 0 480 210 0 ...
##  $ INJURIES   : num  0 0 0 170 0 805 0 774 280 0 ...
##  $ PROPDMG_USD: num  5000 0 0 3721800 0 ...
##  $ CROPDMG_USD: num  0 0 0 0 0 ...

Although the MEAN is the more uselful for making decisions on deloying resources, we assume the question applies to SUM since it is not explicit.

Results

Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

## Population Health is represented by Fatalities and Injuries

Fatal_Data<-head(ImpactSummary[order(ImpactSummary[,2],decreasing=TRUE),],10)[,c(1,2)]
Fatal_Data
##           EVTYPE FATALITIES
## 92       TORNADO       5636
## 31          HEAT       2957
## 25         FLOOD       1784
## 51     LIGHTNING        817
## 91  THUNDERSTORM        736
## 117         WIND        673
## 8        COASTAL        480
## 79   RIP CURRENT        368
## 118      WINTERY        278
## 4      AVALANCHE        224
Injury_Data<-head(ImpactSummary[order(ImpactSummary[,3],decreasing=TRUE),],10)[,c(1,3)]
Injury_Data
##           EVTYPE INJURIES
## 92       TORNADO    91407
## 25         FLOOD    11911
## 91  THUNDERSTORM     9511
## 31          HEAT     8830
## 51     LIGHTNING     5231
## 117         WIND     1954
## 118      WINTERY     1953
## 23         FIRES     1608
## 30          HAIL     1371
## 41     HURRICANE     1326
Fatalities <- ggplot(data=Fatal_Data,aes(y=FATALITIES,x=EVTYPE)) +geom_bar(size=1,colour="black",fill="dark red",stat="identity") +theme(axis.text.x = element_text(angle = 65, hjust = 1)) +labs(list(x="Event Types",y="Number Of Fatalities"))
Injuries <- ggplot(data=Injury_Data,aes(y=INJURIES,x=EVTYPE)) +geom_bar(size=1,colour="black",fill="red",stat="identity") +theme(axis.text.x = element_text(angle = 60, hjust = 1)) +labs(list(x="Event Types",y="Number Of Injuries"))

grid.arrange(Fatalities, Injuries, main = textGrob("THE MOST HARMFUL WEATHER EVENTS TO THE POPULATION \n HEALTH IN THE U.S. FROM 1950 TO 2011", gp = gpar(fontsize = 14)), ncol = 2)

plot of chunk HealtDamage

The top 3 events that cause the most fatalities are Tornado, Heat and Flood.
The top 3 events that cause the most injuries are Tornado, Flood and Thunderstorm.

Across the United States, which types of events have the greatest economic consequences?

PropertyImpact<-head(ImpactSummary[order(ImpactSummary[,4],decreasing=TRUE),],10)[,c(1,4)]
PropertyImpact
##           EVTYPE PROPDMG_USD
## 25         FLOOD   1.721e+11
## 41     HURRICANE   8.466e+10
## 92       TORNADO   5.699e+10
## 89   STORM SURGE   4.332e+10
## 30          HAIL   1.597e+10
## 91  THUNDERSTORM   1.257e+10
## 23         FIRES   8.497e+09
## 94      TROPICAL   7.716e+09
## 118      WINTERY   6.717e+09
## 117         WIND   6.367e+09
CropImpact<-head(ImpactSummary[order(ImpactSummary[,5],decreasing=TRUE),],10)[,c(1,5)]
CropImpact
##           EVTYPE CROPDMG_USD
## 25         FLOOD   1.931e+10
## 17       DROUGHT   1.397e+10
## 41     HURRICANE   5.505e+09
## 30          HAIL   3.047e+09
## 9           COLD   1.379e+09
## 91  THUNDERSTORM   1.274e+09
## 117         WIND   9.033e+08
## 31          HEAT   8.989e+08
## 67          RAIN   7.950e+08
## 94      TROPICAL   6.949e+08
Property <-ggplot(data=PropertyImpact,aes(y=PROPDMG_USD/(10^9),x=EVTYPE))+geom_bar(stat="identity", color = "black", fill = " dark green")+theme(axis.text.x = element_text(angle = 65, hjust = 1))+labs(list(x="Event Types",y="Property Impact in Billions"))
Crops<- ggplot(data=CropImpact,aes(y=CROPDMG_USD/(10^9),x=EVTYPE))+geom_bar(stat="identity", color= "black", fill = "green")+theme(axis.text.x = element_text(angle = 65, hjust = 1))+labs(list(x="Event Types",y="Crops Impact in Billions"))
grid.arrange(Property, Crops, main = textGrob("THE MOST HARMFUL WEATHER EVENTS TO THE PROPERTY AND\n CROPS IN THE U.S. FROM 1950 TO 2011", gp = gpar(fontsize = 14)), ncol = 2)

plot of chunk EconomicDamage

The top 3 events that casue the most damage to Property are Flood, Huricane and Tornado.
The top 3 events that casue the most damage to Crops are Flood, Drought and Huricane.

WeatherSummary <- data.frame(FATALITIES = Fatal_Data$EVTYPE,INJURIES = Injury_Data$EVTYPE,PROPERTY = PropertyImpact$EVTYPE,CROPS = CropImpact$EVTYPE)

WeatherSummary
##      FATALITIES     INJURIES     PROPERTY        CROPS
## 1       TORNADO      TORNADO        FLOOD        FLOOD
## 2          HEAT        FLOOD    HURRICANE      DROUGHT
## 3         FLOOD THUNDERSTORM      TORNADO    HURRICANE
## 4     LIGHTNING         HEAT  STORM SURGE         HAIL
## 5  THUNDERSTORM    LIGHTNING         HAIL         COLD
## 6          WIND         WIND THUNDERSTORM THUNDERSTORM
## 7       COASTAL      WINTERY        FIRES         WIND
## 8   RIP CURRENT        FIRES     TROPICAL         HEAT
## 9       WINTERY         HAIL      WINTERY         RAIN
## 10    AVALANCHE    HURRICANE         WIND     TROPICAL

The above table can be used like a cheatsheet to quickly determine the top ten events on each of the segments, in their order of significanse and impact