Impact of weather on human life and economy

Synopsis

This report analyzes data on major storms and weather events in the United States obtained from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. The goal of this report is to analyze and visually show the worst weather events impacting:

Health and well-being of human population
Economic damages to property and crop

This analysis shows that the top 5 severe weather events affecting population are:
* Tornado
* Excessive heat
* Flash floods
* Lightning
* Thunderstorm wind

The top 5 severe weather events affecting property and crop economically are:
* Flood
* Hurricane (Typhoon)
* Tornado
* Storm Surge
* Hail

The accuracy of this report depends on the extent of correctness in the raw data recorded by various agencies across the United States and assembled in the NOAA’s database.

Data Processing

Software requirements and environment setup

This analysis was done using
* R version 3.1.2
* RStudio IDE version 0.98.1102

The following external R packages were used:
* R.utils * dplyr
* tidyr
* ggplot2
* knitr

The following R packages are used in the data processing and analysis in this report. These pacakges should be installed as a prerequisite for running this report. Once they are installed, the libraries are loaded in the environment.

library(R.utils)
library(dplyr)
library(tidyr)
library(ggplot2)

A couple of custom functions were written to be used later in the analysis. These are defined as follows.

## vsub()     This function takes in a vector, finds a matching pattern and 
##            replaces it with another character vector passed. It returns back
##            the new vector. This will throw a warning if x is a factor and the 
##            replace vector is not an existing level.
## x          The character vector to be changed
## pattern    A regular expression to find matching strings to replace
## replace    The character vector to use as a replacement for matching patterns
vsub <- function(x, pattern, replace) {
  x[which(grepl(pattern, x))] <- replace
  return(x)
}

## calcapply()  This function returns the numeric equivalent of a "H", "K", "M"
##              and "B" passed. This is used to get appropriate multipliers for
##              hundreds, thousands, millions, and billions.
## x            A character representing a multiplier
calcapply <- function(x) {
  x <- toupper(x)
  
  if(x == "H")
    return(100)
  else if(x == "K")
    return(1000)
  else if(x == "M")
    return(1000000)
  else if(x == "B")
    return(1000000000)
  else
    return(0)

}

Downloading and reading the data

The data is available in the form of a bzip2 compressed CSV file and can be downloaded from Storm Data

As a first step, the downloaded file is uncompressed to extract the CSV file which is then read into a data frame for further processing.

if(file.exists("repdata-data-StormData.csv.bz2"))
  bunzip2("repdata-data-StormData.csv.bz2", overwrite = TRUE)

w <- read.csv("repdata-data-StormData.csv", header=TRUE, stringsAsFactors=FALSE)

The raw data consists of 902297 observations and 37 variables. The documentation on the National Weather Service website was referenced to understand the data. The source for this documentation is available at:
* National Weather Service (NWS) Storm Data Documentation
* National Climatic Data Center Storm Events FAQ

To determine the two results stated above in the synopsis, only a subset of the raw data is required. The variables required for this analysis are:
* EVTYPE
* FATALITIES
* INJURIES
* PROPDMG
* PROPDMGEXP
* CROPDMG
* CROPDMGEXP

Fortunately, the data in these variables is complete and there are no NAs or NANs in this subset. So we subset the data for these variables and convert the event types to upper case to standardize the user inputs.

w1 <- w %>% select(EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)
w1$EVTYPE <- toupper(w1$EVTYPE)

Data Cleaning

The most important variable in this dataset used to answer our questions is EVTYPE, and it is poorly constructed. Due to the data being sourced from various external agencies, the format of the event type names varies greatly. The code book describes allowed values for each variable and using the valid definitions from the NWS code book, we attempt to clean up most of the significant events that will be used in the analysis and results phase.

## clean up evtype based on valid Event Types from the code book
w1 <- w1 %>% filter(EVTYPE != "?")        ## one observation without a event type
w1$EVTYPE <- gsub("HVY", "HEAVY", w1$EVTYPE)
w1$EVTYPE <- gsub("TSTM|THUDERSTORM|TUNDERSTORM|THUNDERTORM|THUNDEERSTORM|THUNDERESTORM", "THUNDERSTORM", w1$EVTYPE)

w1$EVTYPE <- vsub(w1$EVTYPE, "^TORNADO.*", "TORNADO")
w1$EVTYPE <- vsub(w1$EVTYPE, "^STRONG WIND?", "STRONG WIND")
w1$EVTYPE <- vsub(w1$EVTYPE, ".*BLIZZARD.*", "BLIZZARD")
w1$EVTYPE <- vsub(w1$EVTYPE, "^HEAVY SNOW.*", "HEAVY SNOW")
w1$EVTYPE <- vsub(w1$EVTYPE, "^HIGH WIND.*|^WINDS.*|^WIND|^SNOW.*WIND|NON-THUNDERSTORM WIND", "HIGH WIND")
w1$EVTYPE <- vsub(w1$EVTYPE, ".*WIND.*CHILL|^EXTREME COLD.*", "EXTREME COLD/WIND CHILL")
w1$EVTYPE <- vsub(w1$EVTYPE, ".*THUNDERSTORM.*WIND.*|^GUSTY.*|.*SEVERE THUNDERSTORM.*", "THUNDERSTORM WIND")
w1$EVTYPE <- vsub(w1$EVTYPE, "^WILD.*FIRE", "WILDFIRE")
w1$EVTYPE <- vsub(w1$EVTYPE, "^HEAT|^UNSEASONABLY WARM.*", "HEAT")
w1$EVTYPE <- vsub(w1$EVTYPE, "^.*EXCESSIVE HEAT|^EXTREME HEAT|^RECORD HEAT", "EXCESSIVE HEAT")
w1$EVTYPE <- vsub(w1$EVTYPE, "^RIP CURRENTS", "RIP CURRENT")
w1$EVTYPE <- vsub(w1$EVTYPE, "^HURRICANE", "HURRICANE(TYPHOON)")
w1$EVTYPE <- vsub(w1$EVTYPE, "^TROPICAL STORM.*", "TROPICAL STORM")
w1$EVTYPE <- vsub(w1$EVTYPE, "^WINTER STORM.*", "WINTER STORM")
w1$EVTYPE <- vsub(w1$EVTYPE, "^WINTER WEATHER.*|^WINTRY.*", "WINTER WEATHER")
w1$EVTYPE <- vsub(w1$EVTYPE, "^FREEZING.*", "FREEZING FOG")
w1$EVTYPE <- vsub(w1$EVTYPE, "^FOG", "DENSE FOG")
w1$EVTYPE <- vsub(w1$EVTYPE, "^.*SURF", "HIGH SURF")
w1$EVTYPE <- vsub(w1$EVTYPE, "^.*FLASH.*FLOOD.*", "FLASH FLOOD")
w1$EVTYPE <- vsub(w1$EVTYPE, "^FLOOD.+|^RIVER FLOOD", "FLOOD")
w1$EVTYPE <- vsub(w1$EVTYPE, "^STORM SURGE.*", "STORM SURGE/TIDE")
w1$EVTYPE <- vsub(w1$EVTYPE, "^.*ICE.*", "ICE STORM")
w1$EVTYPE <- vsub(w1$EVTYPE, "^SNOW.*|[BLOWING|RECORD|EXCESSIVE] SNOW", "HEAVY SNOW")

w1$EVTYPE <- vsub(w1$EVTYPE, ".*FREEZE.*", "FROST/FREEZE")
w1$EVTYPE <- vsub(w1$EVTYPE, ".*WATERSPOUT.*", "WATERSPOUT")
w1$EVTYPE <- vsub(w1$EVTYPE, "^HAIL.*|HAIL.+", "HAIL")
w1$EVTYPE <- vsub(w1$EVTYPE, ".*HEAVY.*RAIN.*", "HEAVY RAIN")


w1$EVTYPE <- as.factor(w1$EVTYPE)

Analysis

Impact on population health

There is huge variation in the data of our variables and therefore to focus on the most significant events that answer our driving questions, all events that did not result in any fatality or injury are dropped from the analysis result set. Furthermore, after grouping and summing the total fatalities and injuries by each reported event type, only the ten highest observations are considered for final comparison and graphing.
Finally, for both fatalities and injuries the percentages of total is calculated and merged in a single data frame and the variables collapsed into a single percentage number qualified by the type, whether fatality or injury.

## Subset the data to process events that were most harmful to population health
## in terms of fatalities and injuries caused. Therefore ignore events that had
## no fatalities or injuries
w2 <- w1 %>%  filter(FATALITIES != 0 | INJURIES != 0)

## Summarize the totals of fatalities and injuries for each level of Event Type
w2 <- w2 %>% group_by(EVTYPE) %>% summarize(TOTAL_FATALITIES = sum(FATALITIES), 
                                            TOTAL_INJURIES = sum(INJURIES))

## From w2, subset the data to choose events that had more than 10 fatalities as
## we would like to examine the most harmful events 
fatal10 <- w2 %>% filter(TOTAL_FATALITIES > 10) 

## Convert the actual number of fatalities to percentages and store in a new data frame
perc_fatal10 <- fatal10 %>% 
                mutate(FATALITIES = TOTAL_FATALITIES/sum(TOTAL_FATALITIES) * 100) %>%
                select(EVTYPE, FATALITIES)


## Do the same with injuries and get the events with more than 10 injuries
injure10 <- w2 %>% filter(TOTAL_INJURIES > 10) 

## Convert the actual number of injuries to percentages and store in a new data frame
perc_injure10 <- injure10 %>% 
                mutate(INJURIES = TOTAL_INJURIES/sum(TOTAL_INJURIES) * 100) %>%
                select(EVTYPE, INJURIES)

## Now merge the two percentage data frames based on common events so we can plot
## a stacked bar chart of both percentage of fatalities and injuries
harmful <- merge(perc_fatal10, perc_injure10, by.x="EVTYPE", by.y="EVTYPE", all=TRUE)
harmful <- harmful %>% gather(TypeOfHarm, Percentage, FATALITIES, INJURIES, na.rm=TRUE)

Economic consequences

For this part of the analysis, the previously cleaned dataset is filtered to select observations that resulted in some non-zero property or crop damages. Each type of damage is recorded with a qualifier stating if the amount is in hundreds, thousands, millions or billions. This qualifier is a single alphanumeric. The custom function calcapply() is applied to the PROPDMEXP and CROPDMGEXP to convert them into numeric variables containing the multiplier. The actual damages are then calculated by multiplying this to the PROPDMG and CROPDMG variables respectively.
Once the final damages are known, we group the data by each event type and calculate the total of both types of damages and convert the result to billions of dollars by dividing the sum with 1 billion. Then, the results are ordered from highest to lowest and the top twenty observations are selected for graphing the results.

## Use the cleaned data and subset only observations that have non-zero damages
eco <- w1 %>% filter(PROPDMG != 0 | CROPDMG != 0)

## convert the alphabetic cost description to an equivalent multiplier for both
## property and crop damages
eco$PROPDMGEXP <- sapply(eco$PROPDMGEXP, calcapply)
eco$CROPDMGEXP <- sapply(eco$CROPDMGEXP, calcapply)

## With the multiplier obtained above, calculate the actual damage amount of each
## type and add it as new variables 
eco <- eco %>% mutate(ACTPROPDMG = PROPDMG * PROPDMGEXP,
                    ACTCROPDMG = CROPDMG * CROPDMGEXP)

## For each event type, find the total damages by adding the individual values
## and converting to billions of dollars
eco <- eco %>%  group_by(EVTYPE) %>% 
              summarize(TOTAL_DMG = (sum(ACTPROPDMG, na.rm=T) + sum(ACTCROPDMG, na.rm=T))/1e+9)

## Arrange from highest to lowest and select the top 20 for graphing
eco <- eco %>% arrange(desc(TOTAL_DMG))
eco <- eco[1:20, ]

## Calculate the percentages of the total
eco <- eco %>% mutate(PERC_DMG = (TOTAL_DMG/sum(TOTAL_DMG, na.rm=T)) * 100)

Results

Impact of weather on population health

To understand how different weather events affect the population health in terms of fatalities and injuries, we consider the ten most harmful events and breakdown the percentage of deaths caused by each. The pie chart below shows this distribution. As is obvious from this chart, tornadoes are the most harmful of all weather events, resulting in a huge 37.7% of all fatalities dues to weather.

## Pie chart showing the top 10 events across the US 
h1 <- harmful %>% filter(TypeOfHarm == "FATALITIES") %>% arrange(desc(Percentage))
slices <- h1[1:10, "Percentage"]
lbls <- paste(h1[1:10, "EVTYPE"], "-", format(h1[1:10, "Percentage"], digits=2), "%", sep=" ")
pie(slices, labels=lbls, col=rainbow(length(lbls)), main="Pie chart of 10 Most Harmful Events")

To show the fatalities and injuries separately caused by weather events other than tornadoes, so we can scale the data better, the bar chart below displays how the next worst events stack up for number of fatalities and injuries.

## Filter out the Tornado events to better plot the comparison of other events
h2 <- harmful %>% filter(EVTYPE != "TORNADO" & Percentage > 1)

## Using a facet grid plot two separate graphs for the two types of harm
p <- ggplot(h2, aes(x = reorder(EVTYPE, Percentage), y = Percentage))
p + geom_bar(stat = "identity", fill = "lightblue", colour = "black")  +
  facet_grid(. ~ TypeOfHarm, scales="free") +
  labs(title = "Harmful Impacts of Severe Weather Events",
         x = "Types of Weather Events",
         y = "Estimated Percentages") +
  theme_bw() + 
  theme(axis.text.x = element_text(angle=60, hjust=1),
        plot.title = element_text(face="bold", size=16),
        axis.title.x = element_text(face="bold", size=14),
        axis.title.y = element_text(face="bold", size=14),
        axis.line = element_line(colour="black"),
        strip.text = element_text(face="bold", size=12),
        strip.background = element_rect(fill="lightpink", colour="black", size=1))

Economic consequences of weather events

The graph below shows that the most damages to property and crops is caused by floods.

r <- ggplot(eco, aes(x = reorder(EVTYPE, TOTAL_DMG), y = TOTAL_DMG, fill = PERC_DMG))
r + geom_bar(stat = "identity") + coord_flip() +
    labs(title = "Economic Impact of Severe Weather Events",
         fill = "Percentage\nof Total\nDamages\n",
         x = "Types of Weather Events",
         y = "Estimated Property and Crop Damages (in Billion Dollars)") +
    theme_bw() + 
    theme(panel.grid.major.y = element_line(colour="grey60", linetype="dashed"),
          plot.title = element_text(face="bold", size=16),
          axis.title.x = element_text(face="bold", size=14),
          axis.title.y = element_text(face="bold", size=14))