### About

This is the second peer assessment project for the Reproducible Research course under Data Science specialization. The purpose of the project was to determine which storm event or events had the most significant economic and health effects on the US. Storm Data is an official publication of the National Oceanic and Atmospheric Administration (NOAA) which documents the occurrence of storms and other significant weather phenomena having sufficient intensity to cause loss of life, injuries, significant property damage, and/or disruption to commerce.

# Synopsis

Severe weather has serious economic and health impacts, causing property damage, crop damage, injury and even death. The purpose of this assignment was to determine which weather types had the greatest economic and health effects. Economic effects were operationalized as the degree of property and crop damage. Health effects were operationalized as number of fatalities and injuries.

Data was taken from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database and analyzed using R, Rstudio and knitr package. The raw data was subjected to preprocessing prior to the analysis.

The report begins with initial data processing followed by a subsequent analysis with the most important results plotted. Property damages are given in logarithmic scale due to large range of values. The report ends with results and briefly discuss. The economic impact of weather events was also analyzed.

## Data Processing

Event data prior to 1996 was incomplete; it only contained Tornado, Thunderstorm, Wind and Hail event types, while data in 1996 and after contains all 48 event types that are in current use. The analysis was performed on Storm Data downloaded from the National Climatic Data Center (url: https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2). There is also some documentation of the data available at url: https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf.

The first step was to read the data into a data frame.

storm <- read.csv(("/home/cchisanga/DataScienceSpCourseCoursera/5.Reproducible_Research/PeerAssignment2/repdata-data-StormData.csv.bz2"), strip.white = TRUE)
# head(storm)

Before the analysis, the data need some preprocessing. Event types don’t have a specific format. For instance, there are events with types Frost/Freeze, FROST/FREEZE and FROST/FREEZE which obviously refer to the same type of event.

# number of unique event types
length(unique(storm$EVTYPE))
## [1] 985
# translate all letters to lowercase
event_types <- tolower(storm$EVTYPE)
# replace all punct. characters with a space
event_types <- gsub("[[:blank:][:punct:]+]", " ", event_types)
length(unique(event_types))
## [1] 874
# update the data frame
storm$EVTYPE <- event_types

No further data preprocessing was performed although the event type field can be processed further to merge event types such as tstm wind and thunderstorm wind. After the cleaning, as expected, the number of unique event types reduce significantly. For further analysis, the cleaned event types are used.

# Dangerous Events with respect to Population Health

To find the event types that are most harmful to population health, the number of casualties are aggregated by the event type.

library(plyr)
casualties <- ddply(storm, .(EVTYPE), summarize,
                    fatalities = sum(FATALITIES),
                    injuries = sum(INJURIES))

# Find events that caused most death and injury
fatal_events <- head(casualties[order(casualties$fatalities, decreasing = TRUE), ], 10)
injury_events <- head(casualties[order(casualties$injuries, decreasing = TRUE), ], 10)

Top 10 events that caused largest number of deaths are

fatal_events[, c("EVTYPE", "fatalities")]
##             EVTYPE fatalities
## 737        tornado       5633
## 109 excessive heat       1903
## 132    flash flood        978
## 234           heat        937
## 400      lightning        816
## 760      tstm wind        504
## 148          flood        470
## 511    rip current        368
## 309      high wind        248
## 11       avalanche        224

Top 10 events that caused most number of injuries are

injury_events[, c("EVTYPE", "injuries")]
##                EVTYPE injuries
## 737           tornado    91346
## 760         tstm wind     6957
## 148             flood     6789
## 109    excessive heat     6525
## 400         lightning     5230
## 234              heat     2100
## 377         ice storm     1975
## 132       flash flood     1777
## 670 thunderstorm wind     1488
## 203              hail     1361

# Economic Effects of Weather Events

To analyze the impact of weather events on the economy, available property damage and crop damage reportings/estimates were used.

In the raw data, the property damage is represented with two fields, a number PROPDMG in dollars and the exponent PROPDMGEXP. Similarly, the crop damage is represented using two fields, CROPDMG and CROPDMGEXP. The first step in the analysis is to calculate the property and crop damage for each event.

exp_transform <- function(e) {
    # h -> hundred, k -> thousand, m -> million, b -> billion
    if (e %in% c('h', 'H'))
        return(2)
    else if (e %in% c('k', 'K'))
        return(3)
    else if (e %in% c('m', 'M'))
        return(6)
    else if (e %in% c('b', 'B'))
        return(9)
    else if (!is.na(as.numeric(e))) # if a digit
        return(as.numeric(e))
    else if (e %in% c('', '-', '?', '+'))
        return(0)
    else {
        stop("Invalid exponent value.")
    }
}
prop_dmg_exp <- sapply(storm$PROPDMGEXP, FUN=exp_transform)
storm$prop_dmg <- storm$PROPDMG * (10 ** prop_dmg_exp)
crop_dmg_exp <- sapply(storm$CROPDMGEXP, FUN=exp_transform)
storm$crop_dmg <- storm$CROPDMG * (10 ** crop_dmg_exp)
# Compute the economic loss by event type
library(plyr)
econ_loss <- ddply(storm, .(EVTYPE), summarize,
                   prop_dmg = sum(prop_dmg),
                   crop_dmg = sum(crop_dmg))

# filter out events that caused no economic loss
econ_loss <- econ_loss[(econ_loss$prop_dmg > 0 | econ_loss$crop_dmg > 0), ]
prop_dmg_events <- head(econ_loss[order(econ_loss$prop_dmg, decreasing = T), ], 10)
crop_dmg_events <- head(econ_loss[order(econ_loss$crop_dmg, decreasing = T), ], 10)

Top 10 events that caused most property damage (in dollars) are as follows

prop_dmg_events[, c("EVTYPE", "prop_dmg")]
##                 EVTYPE     prop_dmg
## 132        flash flood 6.820237e+13
## 694 thunderstorm winds 2.086532e+13
## 737            tornado 1.078951e+12
## 203               hail 3.157558e+11
## 400          lightning 1.729433e+11
## 148              flood 1.446577e+11
## 361  hurricane typhoon 6.930584e+10
## 155           flooding 5.920826e+10
## 581        storm surge 4.332354e+10
## 264         heavy snow 1.793259e+10

Similarly, the events that caused biggest crop damage are

crop_dmg_events[, c("EVTYPE", "crop_dmg")]
##                EVTYPE    crop_dmg
## 77            drought 13972566000
## 148             flood  5661968450
## 515       river flood  5029459000
## 377         ice storm  5022113500
## 203              hail  3025974480
## 352         hurricane  2741910000
## 361 hurricane typhoon  2607872800
## 132       flash flood  1421317100
## 118      extreme cold  1312973000
## 179      frost freeze  1094186000

## Results

Health impact of weather events

The following plot shows top dangerous weather event types.

library(ggplot2)
library(gridExtra)
# Set the levels in order
p1 <- ggplot(data=fatal_events,
             aes(x=reorder(EVTYPE, fatalities), y=fatalities, fill=fatalities)) +
  geom_bar(stat="identity") +
  coord_flip() +
  ylab("Total number of fatalities") +
  xlab("Event type") +
  theme(legend.position="none") +
  ggtitle("Top deadly weather events in the US (1950-2011)")

p2 <- ggplot(data=injury_events,
             aes(x=reorder(EVTYPE, injuries), y=injuries, fill=injuries)) +
  geom_bar(stat="identity") +
  coord_flip() + 
  ylab("Total number of injuries") +
  xlab("Event type") +
  theme(legend.position="none") +
  ggtitle("Top deadly weather events in the US (1950-2011)")

grid.arrange(p1, p2)

Tornadoes cause most number of deaths and injuries among all event types. There are more than 5,000 deaths and more than 10,000 injuries in the last 60 years in US, due to tornadoes. The other event types that are most dangerous with respect to population health are excessive heat and flash floods.

# Economic impact of weather events

The following plot shows the most severe weather event types with respect to economic cost that they have costed since 1950s.

library(ggplot2)
library(gridExtra)
# Set the levels in order
p1 <- ggplot(data=prop_dmg_events,
             aes(x=reorder(EVTYPE, prop_dmg), y=log10(prop_dmg), fill=prop_dmg)) +
    geom_bar(stat="identity") +
    coord_flip() +
    xlab("Event type") +
    ylab("Property damage in USD (log-scale)") +
    theme(legend.position="none") +
    ggtitle("Weather costs to the US economy (1950-2011)")

p2 <- ggplot(data=crop_dmg_events,
             aes(x=reorder(EVTYPE, crop_dmg), y=crop_dmg, fill=crop_dmg)) +
    geom_bar(stat="identity") +
    coord_flip() + 
    xlab("Event type") +
    ylab("Crop damage in USD") + 
    theme(legend.position="none") +
    ggtitle("Weather costs to the US economy (1950-2011)")

grid.arrange(p1, p2)

Conclusion

Results reveals tornadoes have significant impact followed by excessive heat under fatalities. Flash floods and thunderstorm winds caused billions of dollars in property damages between 1950 and 2011. The largest crop damage caused was caused by drought, followed by flood, storm and hail.

The data shows that flash floods and thunderstorm winds cost the largest property damages among weather-related natural diseasters.

The most severe weather event in terms of crop damage is the drought. Drought has caused more than 10 billion dollars damage. Other severe crop-damage-causing event types are floods and ice-stroms.