Analysis of the consequences of weather events in the United States based on the U.S. National Oceanic and Atmospheric Administration Storm Database

Author: Murtuza Ali Lakhani Date: September 26, 2015

Reproducible Research: Peer Assessment 2

1. Assignment

The fundamental goal of this assignment is to explore the U.S. National Oceanic and Atmospheric Administration (NOAA) Storm Database and answer the following questions about the consequences of severe weather events:

  1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

  2. Across the United States, which types of events have the greatest economic consequences?

We are to use the database to answer certain questions and show the code for our entire analysis. Our analysis is to consist of tables, figures, or other summaries. We have been given the freedom to use any R package needed to support our analysis.

2. Abstract

A public database maintained by the NOAA contains quantitative data on the type of storm events, details like location, date, estimates for damage to property, and the number of human victims of the storm. The purpose of this study is to explore the type of events that are most harmful to the economy and human life.

In terms of the event most harmful with respect to population health, data show that Heat and Drought cause the highest number of fatalities. Tornadoes and Flooding & High Surf are the next leading causes of fatalities. Tornadoes cause the highest number of injuries, followed by Thunderstorm & Lightning and Flooding & High Surf. In contrast, Landslide & Erosion and Dust & Saharan Winds rank the lowest on the list of human fatality and injuries.

In terms of economic damage, data show Flooding & High Surf and Wind & Storm rank as the top two categories, in that order. Further, data show that Heat & Drought causes the most damage to crops. In contrast, Landslide & Erosion and Dust & Saharan Winds rank the lowest on the list of economic damage caused.

3. Data Processing

Part of the work below has been adapted from Tom Lous’s project.

3.1. Load the requisite libraries if they don’t already exist

These are libraries needed to load, compute, transform, and plot data

if (!("RCurl" %in% rownames(installed.packages())) ) {
  install.packages("RCurl")
} else {
  library(RCurl)
}

if (!("R.utils" %in% rownames(installed.packages())) ) {
  install.packages("R.utils")
} else {
  library(R.utils)
}

if (!("plyr" %in% rownames(installed.packages())) ) {
  install.packages("plyr")
} else {
  library(plyr)
}

if (!("reshape2" %in% rownames(installed.packages())) ) {
  install.packages("reshape2")
} else {
  library(reshape2)
}

if (!("ggplot2" %in% rownames(installed.packages())) ) {
  install.packages("ggplot2")
} else {
  library(ggplot2)
}

if (!("grid" %in% rownames(installed.packages())) ) {
  install.packages("grid")
} else {
  library(grid)
}

if (!("gridExtra" %in% rownames(installed.packages())) ) {
  install.packages("gridExtra")
} else {
  library(gridExtra)
}

if (!("scales" %in% rownames(installed.packages())) ) {
  install.packages("scales")
} else {
  library(scales)
}
3.2. Load the data from source and extract it

RData (binary format) file is used for faster processing. RData file is loaded only once, but simply invoked for all subsequent uses.

dataExists <- TRUE
# Check if the directory path leading up to the Rdata file exists.  Load data if it does.
if(file.exists("C:/Users/murlakha/Desktop/DATA SCIENCE/REPRODUCIBLE RESEARCH/PROJECT TWO/data/StormData.RData")){
  load("C:/Users/murlakha/Desktop/DATA SCIENCE/REPRODUCIBLE RESEARCH/PROJECT TWO/data/StormData.RData")
  dataExists <- FALSE
}

if(dataExists){
  # Create a directory to hold data if it doesn't already exist
  if(!file.exists("C:/Users/murlakha/Desktop/DATA SCIENCE/REPRODUCIBLE RESEARCH/PROJECT TWO/data")){
      dir.create("C:/Users/murlakha/Desktop/DATA SCIENCE/REPRODUCIBLE RESEARCH/PROJECT TWO/data")
  }
  # Load file from URL to bz2 file in data directory.  (Flags adopted from Tom Lous' work.)
  if(!file.exists("C:/Users/murlakha/Desktop/DATA SCIENCE/REPRODUCIBLE RESEARCH/PROJECT TWO/data/StormData.csv.bz2")){
    fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
    destPath <- "C:/Users/murlakha/Desktop/DATA SCIENCE/REPRODUCIBLE RESEARCH/PROJECT TWO/data/StormData.csv.bz2"
    binData <- getBinaryURL(fileUrl, ssl.verifypeer=0L, followlocation=1L)
    destFileHandle <- file(destPath, open="wb")
    writeBin(binData,destFileHandle)
    close(destFileHandle)
  }
  # Unzip bz2 file to csv
  if(!file.exists("C:/Users/murlakha/Desktop/DATA SCIENCE/REPRODUCIBLE RESEARCH/PROJECT TWO/data/StormData.csv")){
    filePath <- "C:/Users/murlakha/Desktop/DATA SCIENCE/REPRODUCIBLE RESEARCH/PROJECT TWO/data/StormData.csv.bz2"
    destPath <- "C:/Users/murlakha/Desktop/DATA SCIENCE/REPRODUCIBLE RESEARCH/PROJECT TWO/data/StormData.csv"
    bunzip2(filePath,destPath,overwrite=TRUE, remove=FALSE)
  }
}
3.3. Read the data

Read the source .csv file

if(dataExists){
  csv_StormData <- read.csv("C:/Users/murlakha/Desktop/DATA SCIENCE/REPRODUCIBLE RESEARCH/PROJECT TWO/data/StormData.csv")
}
3.4. Remove the columns not needed for this analysis

Keep only the columns needed for this analysis, namely BGN_DATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, and CROPDMGEXP.

if(dataExists){
  needed_Columns <- c("BGN_DATE", "EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")
  truncated_StormData <- csv_StormData[, needed_Columns]
  truncated_StormData <- rename(truncated_StormData, c("FATALITIES"="fatalities", "INJURIES"="injuries"))
}
3.5. Drop 15% of data, mostly from earlier years. Reformat BGN_DATE.

Since most of the data is from later years, drop 85% of data (cutoff tentatively set at 85%) from the earlier period.

if(dataExists){
  # get some counts
  total_number_of_observations <- nrow(truncated_StormData)
  cutoff_percentage = 0.85
  cutoff_observations = round(total_number_of_observations * cutoff_percentage)
  
  # add columns for date calculations based on BGN_DATEro
  truncated_StormData$year = as.numeric(format(as.Date(truncated_StormData$BGN_DATE, format = "%m/%d/%Y"), "%Y"))
  
  # create dataset with count per year, reverse the recordset, create running total 
  year_Records <- count(truncated_StormData, "year")
  year_Records <- year_Records[order(year_Records$year, decreasing=TRUE), ]
  year_Records$runningTotal = cumsum(year_Records$freq)
  cutoff_Year <- min(year_Records[year_Records$runningTotal < cutoff_observations, 1])
  
  # reduce the dataset
  truncated_StormData <- truncated_StormData[truncated_StormData$year >= cutoff_Year, ]
  endYear <- max(truncated_StormData$year)
  
  # clean truncated_StormData
  truncated_StormData$BGN_DATE <- NULL
  rownames(truncated_StormData) <- NULL
}
3.6. Compress/re-factor EVTYPE to just 11 levels.

The EVTYPE variable contains hundreds of unique source events, many of which can be clustered together. We will define 11 levels, covering effectively the majority of the data records. Summaries and combinations will be skipped.

if(dataExists){
  truncated_StormData$damageSource <- NA
  
  truncated_StormData[grepl("precipitation|rain|hail|drizzle|wet|percip|burst|depression|fog|wall cloud", 
                         truncated_StormData$EVTYPE, ignore.case = TRUE), "damageSource"] <- "Precipitation & Fog"
  truncated_StormData[grepl("wind|storm|wnd|hurricane|typhoon", 
                         truncated_StormData$EVTYPE, ignore.case = TRUE), "damageSource"] <- "Wind & Storm"
  truncated_StormData[grepl("slide|erosion|slump", 
                         truncated_StormData$EVTYPE, ignore.case = TRUE), "damageSource"] <- "Landslide & Erosion"
  truncated_StormData[grepl("warmth|warm|heat|dry|hot|drought|thermia|temperature record|record temperature|record high", 
                         truncated_StormData$EVTYPE, ignore.case = TRUE), "damageSource"] <- "Heat & Drought"
  truncated_StormData[grepl("cold|cool|ice|icy|frost|freeze|snow|winter|wintry|wintery|blizzard|chill|freezing|avalanche|glaze|sleet", 
                         truncated_StormData$EVTYPE, ignore.case = TRUE), "damageSource"] <- "Snow & Ice"
  truncated_StormData[grepl("flood|surf|blow-out|swells|fld|dam break", 
                         truncated_StormData$EVTYPE, ignore.case = TRUE), "damageSource"] <- "Flooding & High Surf"
  truncated_StormData[grepl("seas|high water|tide|tsunami|wave|current|marine|drowning", 
                         truncated_StormData$EVTYPE, ignore.case = TRUE), "damageSource"] <- "High Seas"
  truncated_StormData[grepl("dust|saharan", 
                         truncated_StormData$EVTYPE, ignore.case = TRUE), "damageSource"] <- "Dust & Saharan Winds"  
  truncated_StormData[grepl("tstm|thunderstorm|lightning", 
                         truncated_StormData$EVTYPE, ignore.case = TRUE), "damageSource"] <- "Thunderstorm & Lightning"
  truncated_StormData[grepl("tornado|spout|funnel|whirlwind", 
                         truncated_StormData$EVTYPE, ignore.case = TRUE), "damageSource"] <- "Tornado"
  truncated_StormData[grepl("fire|smoke|volcanic", 
                         truncated_StormData$EVTYPE, ignore.case = TRUE), "damageSource"] <- "Fire & Volcanic Activity"
  
  # Remove uncategorized (damageSource = NA) records.  Cast damageSource as a factor.
  truncated_StormData <- truncated_StormData[complete.cases(truncated_StormData[, "damageSource"]), ]
  truncated_StormData$damageSource <- as.factor(truncated_StormData$damageSource)
  
  # Clean EVTYPE variable in truncated_StormData
  truncated_StormData$EVTYPE <- NULL
}
3.7. Refactor PROPDMG, CROPDMG, PROPDMGEXP & CROPDMGEXP to absolute values

Assign absolute values to DMG and DMGEXP fields. Impute NA for all undefined EXP properties, such as +, ?.

if(dataExists){
  # Function to convert symbol to a power of 10 (hundreds, thousands, millions, and billions).  (Adapted from Tom Lous' work.)
  to_the_power_of_ten <- function(x){
    if(is.numeric(x)) {
      x <- x
    }
    else if(grepl("h", x, ignore.case=TRUE)) {
      x <- 2
    }
    else if(grepl("k", x, ignore.case=TRUE)) {
      x <- 3
    }
    else if(grepl("m", x, ignore.case=TRUE)) {
      x <- 6
    }
    else if(grepl("b", x, ignore.case=TRUE)) {
      x <- 9
    }
    else if(x == "" || x == " "){
      x <- 0
    }
    else{
      x <- NA
    }
    x
  }
   
  # Function to take two parameters num and exp and convert them to one absolute value. Assign zero to non-integers.
  calculate_damage_amount <- function(num, exp){
    raise_to_this_power <- to_the_power_of_ten(exp)
    if(is.numeric(num)){
      num <- num * (10 ^ raise_to_this_power)
    }
    
    if(!is.numeric(num)){
      num <- 0
    }
    
    num
  }
  
  # Create two new fields for calculated propDamage & cropDamage and add them to the damageTotal field.
  truncated_StormData$propDamage <- mapply(calculate_damage_amount, truncated_StormData$PROPDMG, truncated_StormData$PROPDMGEXP)
  truncated_StormData$cropDamage <- mapply(calculate_damage_amount, truncated_StormData$CROPDMG, truncated_StormData$CROPDMGEXP)
  truncated_StormData$damageTotal = truncated_StormData$propDamage + truncated_StormData$cropDamage
  
  # Clean out property/crop and damage/expense fields in truncated_StormData
  truncated_StormData$PROPDMG <- NULL
  truncated_StormData$PROPDMGEXP <- NULL
  truncated_StormData$CROPDMG <- NULL
  truncated_StormData$CROPDMGEXP <- NULL
}
3.8. Create aggregated datasets and variables for plots

Recast final data frames to be used in certain plot funtions.

if(dataExists){
  # Aggregate economic damage per damageSource
  Sum_of_Economic_Damage <- aggregate(formula=cbind(propDamage, cropDamage, damageTotal) ~ damageSource, data=truncated_StormData, FUN=sum, na.rm=TRUE)
  Sum_of_Economic_Damage <- Sum_of_Economic_Damage[order(Sum_of_Economic_Damage$damageTotal, decreasing=TRUE),]
  rownames(Sum_of_Economic_Damage) <- NULL
  Sum_of_Economic_Damage$damageSource <- factor(Sum_of_Economic_Damage$damageSource, levels=rev(Sum_of_Economic_Damage$damageSource))
  
  # Melt the Sum_of_Economic_Damage into data frame to be used as bar chart
  meltSum_of_Economic_Damage <- melt(Sum_of_Economic_Damage, id.vars=c("damageSource"), measure.vars=c("propDamage","cropDamage"), variable.name="damageType", value.name="damage")
  levels(meltSum_of_Economic_Damage$damageType)[levels(meltSum_of_Economic_Damage$damageType)=="propDamage"] <- "property"
   levels(meltSum_of_Economic_Damage$damageType)[levels(meltSum_of_Economic_Damage$damageType)=="cropDamage"] <- "crops"
  
  # Aggregate humanitarian damage per damageSource
  Sum_of_Human_Damage <-aggregate(formula=cbind(injuries, fatalities) ~ damageSource, data=truncated_StormData, FUN=sum, na.rm=TRUE) 
  Sum_of_Human_Damage <- Sum_of_Human_Damage[order(Sum_of_Human_Damage$injuries, decreasing=TRUE),]
  Sum_of_Human_Damage_by_Fatalities <- Sum_of_Human_Damage[order(Sum_of_Human_Damage$fatalities, decreasing=TRUE),]
  rownames(Sum_of_Human_Damage) <- NULL
  rownames(Sum_of_Human_Damage_by_Fatalities) <- NULL
  Sum_of_Human_Damage$damageSource <- factor(Sum_of_Human_Damage$damageSource, levels=rev(Sum_of_Human_Damage$damageSource))
  
  # Define maximum values for bar chart axis
  maxInjuries <- max(Sum_of_Human_Damage$injuries)
  maxInjuries <- maxInjuries + round(maxInjuries * 0.25)
 
  maxFatalities <- max(Sum_of_Human_Damage$fatalities)
  maxFatalities <- maxFatalities + round(maxFatalities * 0.25)  
}
3.10. Save truncated_StormData to the RData file

Save the processed data to RData file.

if(dataExists){
  save(truncated_StormData, 
       Sum_of_Human_Damage, 
       Sum_of_Human_Damage_by_Fatalities,
       meltSum_of_Economic_Damage,
       Sum_of_Economic_Damage, 
       maxInjuries, 
       maxFatalities,
      cutoff_Year,
       endYear,
       file="C:/Users/murlakha/Desktop/DATA SCIENCE/REPRODUCIBLE RESEARCH/PROJECT TWO/data/StormData.RData")
}

4. Results

4.1. Show the first & last five lines of the new, processed data set

Display a few records of the cleaned, reformatted stormData to be used for analysis.

head(truncated_StormData, n=5L)
##   fatalities injuries year             damageSource propDamage cropDamage
## 1          0        0 1989                  Tornado    2500000          0
## 2          0        0 1989                  Tornado     250000          0
## 3          0        0 1989 Thunderstorm & Lightning          0          0
## 4          0        0 1989                  Tornado    2500000          0
## 5          0        0 1989 Thunderstorm & Lightning          0          0
##   damageTotal
## 1     2500000
## 2      250000
## 3           0
## 4     2500000
## 5           0
tail(truncated_StormData, n=5L)
##        fatalities injuries year damageSource propDamage cropDamage
## 762146          0        0 2011 Wind & Storm          0          0
## 762147          0        0 2011 Wind & Storm          0          0
## 762148          0        0 2011 Wind & Storm          0          0
## 762149          0        0 2011   Snow & Ice          0          0
## 762150          0        0 2011   Snow & Ice          0          0
##        damageTotal
## 762146           0
## 762147           0
## 762148           0
## 762149           0
## 762150           0
4.2. Injuries and Fatalities

Show the injuries and fatalaties for each major weather event. Tornadoes cause the highest number of injuries, followed by Thunderstorm & Lightning and Flooding & High Surf. In contrast, Landslide & Erosion and Dust & Saharan Winds rank the lowest on the list of causes of human injuries. Heat and Drought cause the highest number of fatalities. Tornadoes and Flooding & High Surf are the next leading causes of fatalities. In contrast, Dust & Saharan Winds and Landslide & Erosion rank the lowest on the list of causes of human fatalities.

# Plot injuries  
ggplot(data=Sum_of_Human_Damage, aes(x=damageSource, y=injuries)) +
            geom_bar(stat = "identity", fill="lightblue", colour = 'red') + 
            geom_text(aes(label=injuries), size=3, vjust=0.5, hjust=2.0) +
            ggtitle("Injuries") +
            scale_y_continuous(expand=c(0, 0), limits=c(0, maxInjuries)) + 
            coord_flip()

# Plot fatalities.  
ggplot(data=Sum_of_Human_Damage[order(Sum_of_Human_Damage$fatalities, decreasing=TRUE),], aes(x=damageSource, y=fatalities)) +
            geom_bar(stat = "identity",  fill="magenta", colour='blue') + 
            geom_text(aes(label=fatalities), size=3, vjust=0.5, hjust=-1.0) +
            ggtitle("Fatalities") +
            scale_y_continuous(expand=c(0, 0), limits=c(0,maxFatalities)) + 
            coord_flip()

The underlying data

Sum_of_Human_Damage
##                damageSource injuries fatalities
## 1                   Tornado    28037       1834
## 2  Thunderstorm & Lightning    13222       1398
## 3      Flooding & High Surf     8931       1715
## 4            Heat & Drought     8865       2969
## 5                Snow & Ice     6874       1344
## 6              Wind & Storm     3601        644
## 7       Precipitation & Fog     2528        195
## 8  Fire & Volcanic Activity     1608         90
## 9                 High Seas     1139        898
## 10     Dust & Saharan Winds      483         24
## 11      Landslide & Erosion       55         44
4.3. Economic Damage

In terms of economic damage, data show Flooding & High Surf and Wind & Storm rank as the top two categories, in that order. Further, data show that Heat & Drought causes the most damage to crops. In relative terms, the damage to property far outweighs that to crops. Contrastingly, Landslide & Erosion and Dust & Saharan Winds rank the lowest on the list of the causes of economic damage.

ggplot(meltSum_of_Economic_Damage, aes(x=damageSource, y=damage/1000000)) + 
  geom_bar(stat = "identity", aes(fill=damageType)) +
  xlab("Event Type") +
  theme(axis.text.x = element_text(angle = 45, size=8, hjust = 1, vjust = 1)) +
  ylab("Total Economic Damage (millions of USD)") +
  ggtitle(paste("Aggregated property and crop damage involving weather events from ",cutoff_Year," to ",endYear, sep=""))

The underlying data

Sum_of_Economic_Damage
##                damageSource   propDamage  cropDamage  damageTotal
## 1      Flooding & High Surf 167702697194 12389477200 180092174394
## 2              Wind & Storm 142713420213  6961399350 149674819563
## 3                   Tornado  33884111913   367458360  34251570273
## 4       Precipitation & Fog  18970846467  3958490253  22929336720
## 5                Snow & Ice  12705867660  8721961900  21427829560
## 6            Heat & Drought   1062504300 14871450280  15933954580
## 7  Thunderstorm & Lightning  11909700240  1283798498  13193498738
## 8  Fire & Volcanic Activity   8502228500   403281630   8905510130
## 9                 High Seas   4809787890    46622500   4856410390
## 10      Landslide & Erosion    328262100    20017000    348279100
## 11     Dust & Saharan Winds      6337630     3600000      9937630