Synopsis

This project explores the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database which tracks characteristics of major storms and weather events in the United States from 1950 to 2011. This analysis answers the following two questions:

  1. Across the United States, which types of events are most harmful with respect to population health?

  2. Across the United States, which types of events have the greatest economic consequences?

The analysis identifies the event types causing over 80% of health and economic impacts.

Regarding the population health impact, these are the following:

Tornado, Excessive heat, Tstm wind, Flood, Lightning

Regarding the economic impact:

Tornado, Flash flood, Tstm Wind, Hail, Flood, Thuderstorm wind and Lightning.

Data processing

The NOAA database is publicly available at the URL below in the code. The data is in bzip format and is directly loaded into the code for analysis.

# Load libraries
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(knitr)

# Constants
bzipFileName = "StormData.csv.bz2"
datFileUrl = "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"

# Check if the data file exists
if(!file.exists(bzipFileName)){
    download.file(datFileUrl,bzipFileName)
}
# Load data
dat <- read.csv(bzipFileName)

Q1: Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

The impact on population health is measured by the total number of fatalities and injuires, represented in the data by the FATALITIES and INJURIES variables.

Analysis

The answer the the question is a direct output of the code below. The analysis identifies which events are responsible for at least 80% of all health impacts, plots their diagram and identifies the top event.

# Group data by event types and arrange in descending order of health impact
dat <- group_by(dat, EVTYPE)
d1 <- summarize(dat, healthImpact=sum(FATALITIES + INJURIES))
d1 <- arrange(d1, desc(healthImpact))
# rearrange EVTYPE
d1$EVTYPE <- factor(d1$EVTYPE, levels = d1$EVTYPE[order(d1$healthImpact, decreasing=TRUE)])
# Determine how many of the event types have health impact (fatalities or injuries)
totalHealthImpact=sum(d1$healthImpact)
# Processing data for additional information
cumulativeHealthImpact <- 0
cumulativeIndex <- 0
i <- 1
# Find how many of the events actually cause fatalities or injuries
# and how many of the top events is responsible for 80% of the health impacts
while ((d1[i+1,2]>0) & (i<nrow(d1)-1)) {
    cumulativeHealthImpact <- cumulativeHealthImpact+d1[i,2]
    if(cumulativeIndex==0 & cumulativeHealthImpact>= 0.8*totalHealthImpact){
        cumulativeIndex <- i
        cumulativeImpact <- cumulativeHealthImpact
    }
    i <- i+1
}

Results

print.noquote(paste0("Of the ",nrow(d1)," event types, ",i," (",
                     round(i/nrow(d1)*100,2),"%) caused fatalities or injuries."))
## [1] Of the 985 event types, 220 (22.34%) caused fatalities or injuries.
print.noquote(paste0(round(cumulativeImpact/totalHealthImpact*100,2),
                     "% of health impacts are caused by the top ",
                     cumulativeIndex," events."))
## [1] 81.05% of health impacts are caused by the top 5 events.
# Now concentrate on the events causing 80% of the health impacts, plot
d1 <- head(d1, cumulativeIndex)

ggplot(data=d1, aes(x=EVTYPE, y=healthImpact)) +
    geom_bar(stat="identity") +
    labs(title="Number of health impacts for the top weather events") +
    labs(x="Event type") + labs(y="Health impacts")

# Information about the #1 event
print.noquote(paste0("The top event is ", d1[[1,1]]))
## [1] The top event is TORNADO
print.noquote(paste0("It alone is responsible for ",
                     round(d1[1,2]/totalHealthImpact*100,2),
                     "% of all fatalities and injuries."))
## [1] It alone is responsible for 62.3% of all fatalities and injuries.

Q2: Across the United States, which types of events have the greatest economic consequences?

The economic damage is measured by the damage caused to properties and crop (variables PROPDMG + CROPDMG).

Analysis

The answer the the question is a direct output of the code below. The analysis identifies which events are responsible for at least 80% of all health impacts, plots their diagram and identifies the top event.

# Arrange data in descending order of economic impact
d2 <- summarize(dat, econImpact=sum(PROPDMG + CROPDMG))
d2 <- arrange(d2, desc(econImpact))
# rearrange EVTYPE
d2$EVTYPE <- factor(d2$EVTYPE, levels = d2$EVTYPE[order(d2$econImpact, decreasing=TRUE)])
# Determine how many of the event types have economic impact (propery and crop damage)
totalEconImpact=sum(d2$econImpact)
# Processing data for additional information
cumulativeEconImpact <- 0
cumulativeIndex <- 0
i <- 1
# Find how many of the events actually cause property and crop damages
# and how many of the top events is responsible for 80% of the economic impacts
while ((d2[i+1,2]>0) & (i<nrow(d2)-1)) {
    cumulativeEconImpact <- cumulativeEconImpact+d2[i,2]
    if(cumulativeIndex==0 & cumulativeEconImpact>= 0.8*totalEconImpact){
        cumulativeIndex <- i
        cumulativeImpact <- cumulativeEconImpact
    }
    i <- i+1
}

Results

print.noquote(paste0("Of the ",nrow(d2)," event types, ",i," (",
                     round(i/nrow(d2)*100,2),"%) caused property or crop damages."))
## [1] Of the 985 event types, 431 (43.76%) caused property or crop damages.
print.noquote(paste0(round(cumulativeImpact/totalEconImpact*100,2),
                     "% of economic impacts are caused by the top ",
                     cumulativeIndex," events."))
## [1] 83.54% of economic impacts are caused by the top 7 events.
# Now concentrate on the events causing 80% of the economic impacts, plot
d2 <- head(d2, cumulativeIndex)

ggplot(data=d2, aes(x=EVTYPE, y=econImpact)) +
    geom_bar(stat="identity") +
    labs(title="Economy impact of the top weather events") +
    labs(x="Event type") + labs(y="Economy impacts")

# Information about the #1 event
print.noquote(paste0("The top event is ", d2[[1,1]]))
## [1] The top event is TORNADO
print.noquote(paste0("It alone is responsible for ",
                     round(d2[1,2]/totalEconImpact*100,2),
                     "% of all property and crop damages."))
## [1] It alone is responsible for 27.01% of all property and crop damages.