Synopsis

In this report, we aim to highlight the weather events that are responsible for the greatest impact on population health and the economy. Data from U.S. National Oceanic and Atmospheric Administration’s (NOAA) database has been used to inform this report. The events in the database where taken from the years 1950 to end in November 2011. Our analysis shows that the top three most impactful weather events on the population is tornado, excessive heat and TSTM Wind (marine thunderstorm wind). The top three weather event types that cause the greatest econonomic consequences are flood, hurricane/typhoon and tornado.

Data Processing

From the Reproduceable Research Coursera website, we obtained the data for the analysis.

Documentation of how the database and data are constructed/defined can be found here:

  1. National Weather Service Storm Data Document

  2. National Climatic Data Center Storm Event FAQ

The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database, there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.

Downloading and reading in the data

We first download the data. The data come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size.

stormDataUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
bzip2temp <- tempfile()
download.file(stormDataUrl, bzip2temp, method="curl")
rawStormData <- read.csv(bzip2temp)
unlink(bzip2temp)

After reading in the raw data, we check the first few rows (there are 902297 rows) in this dataset.

dim(rawStormData)
## [1] 902297     37
head(rawStormData[,1:20])
##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL
## 6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL
##    EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO         0                                               0
## 2 TORNADO         0                                               0
## 3 TORNADO         0                                               0
## 4 TORNADO         0                                               0
## 5 TORNADO         0                                               0
## 6 TORNADO         0                                               0
##   COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH
## 1         NA         0                      14.0   100
## 2         NA         0                       2.0   150
## 3         NA         0                       0.1   123
## 4         NA         0                       0.0   100
## 5         NA         0                       0.0   150
## 6         NA         0                       1.5   177

Results

Events which are most harmful with respect to population health

To analyse which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health, we focus on 3 columns in the dataset:

  1. EVTYPE which describes the weather events

  2. FATALITIES which summarises the number of fatalities attributed to the event

  3. INJURIES which summarises the number of injuries attribute to the event.

We start by summing the total fatalities and injuries per event type and adding these two variables to indicate the total number of people negatively affected by the event. We are only interested in those events where there is at least one person who suffered fatality or injury.

library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(reshape2)
library(ggplot2)

## some options setting
options(scipen=999)
options(width=120)

healthEventSummary <- group_by(rawStormData, EVTYPE) %>%
  summarise(total_fatalities=sum(FATALITIES),total_injuries=sum(INJURIES)) %>%
  mutate(total_affected=total_fatalities+total_injuries) %>%
  filter(total_affected > 0) 

We take the events which total affected is in the top 10 quantile.

healthEventSummary <- subset(healthEventSummary, total_affected >= quantile(total_affected,0.9))

Number of rows in this subset is:

nrow(healthEventSummary)
## [1] 22

In order to see the top most harmful events, we can sort the base on total affected in a descending order and display the top few records.

head(arrange(healthEventSummary, desc(total_affected)))
## Source: local data frame [6 x 4]
## 
##           EVTYPE total_fatalities total_injuries total_affected
## 1        TORNADO             5633          91346          96979
## 2 EXCESSIVE HEAT             1903           6525           8428
## 3      TSTM WIND              504           6957           7461
## 4          FLOOD              470           6789           7259
## 5      LIGHTNING              816           5230           6046
## 6           HEAT              937           2100           3037

To give us a better view of the top number, we plot a stacked bar graph

## remove the total affected columns so 
healthEventSummary<- select(healthEventSummary, -total_affected)
## reshape the data into a long format
meltedTop10HealthEvents <- melt(healthEventSummary, id.vars="EVTYPE")
## draw the plot
ggplot(meltedTop10HealthEvents, aes(x=EVTYPE, y=value, fill=variable)) + 
geom_bar(stat='identity') + 
scale_fill_discrete(name="Harm to population",
                    breaks=c("total_fatalities","total_injuries"),
                    labels=c("Fatalities","Injuries")) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
labs(y="Number of people affected") +
labs(x="Storm Event Types") +
labs(title="Storm event types with highest harm to population")

From the plot above, it is very clear that tornados causes the most harm to the population.

Events that have the greatest economic consequences

To analyse which types of events (as indicated in the EVTYPE variable) have the greatest economic consequences, we focus on the following columns in the dataset:

  1. EVTYPE which describes the weather events

  2. PROPDMG which summarises the property damage estimates in actual dollar amounts, attributed to the event. Estimates are rounded to three significant digits.

  3. PROPDMGEXP which indicate the magnitude of the number stored in PROPDMG.

  4. CROPDMG which summarises crop damage estimates in actual dollar amounts, attributed to the event.

  5. CROPDMGEXP which indicate the magnitude of the number stored in CROPDMG

With the raw data that was read at the beginning, we create a subset of data by getting records where the PROPDMG or CROPDMG value is greater than zero and the columns we are interested in. This cuts down the number of rows and columns for processing.

econSummary <- filter(rawStormData, PROPDMG> 0 | CROPDMG > 0) %>%
  select(EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)

To compare the dollar value, we create a column to store the multiplier based on the magnitude and work out the actual damage value.

## column for storing the multiplier for property damage; assume 1 as the default multiplier
econSummary$propdamageExp <- 1
econSummary$propdamageExp[toupper(econSummary$PROPDMGEXP)=='H'] <- 100
econSummary$propdamageExp[toupper(econSummary$PROPDMGEXP)=='K'] <- 1000
econSummary$propdamageExp[toupper(econSummary$PROPDMGEXP)=='M'] <- 1000000
econSummary$propdamageExp[toupper(econSummary$PROPDMGEXP)=='B'] <- 1000000000
econSummary$propertyDamage <- econSummary$PROPDMG*econSummary$propdamageExp
## column for storing the multiplier for crop damage; assume 1 as the default multiplier
econSummary$cropdamageExp <- 1
econSummary$cropdamageExp[toupper(econSummary$CROPDMGEXP)=='H'] <- 100
econSummary$cropdamageExp[toupper(econSummary$CROPDMGEXP)=='K'] <- 1000
econSummary$cropdamageExp[toupper(econSummary$CROPDMGEXP)=='M'] <- 1000000
econSummary$cropdamageExp[toupper(econSummary$CROPDMGEXP)=='B'] <- 1000000000
econSummary$cropDamage <- econSummary$CROPDMG*econSummary$cropdamageExp
## add both the property damage and crop damage for the event
econSummary$totalDamage <- econSummary$propertyDamage + econSummary$cropDamage

To find the events types that has resulted in the greatest economic consequence, we sum the total damage per event type.

econTotals <- group_by(econSummary, EVTYPE) %>%
  summarise(eventTotalDamage=sum(totalDamage))

We take the events which total damage value is in the top 10 quantile.

top10 <- subset(econTotals, eventTotalDamage >= quantile(eventTotalDamage,0.9))

Number of rows in this subset is:

nrow(top10)
## [1] 44

To see the event type that cause the greatest economic consequence, we can sort the the records in the top 10 quantile in descending order of the total damage and display the top few records.

head(arrange(top10, desc(eventTotalDamage)))
## Source: local data frame [6 x 2]
## 
##              EVTYPE eventTotalDamage
## 1             FLOOD     150319678257
## 2 HURRICANE/TYPHOON      71913712800
## 3           TORNADO      57352114049
## 4       STORM SURGE      43323541000
## 5              HAIL      18758222016
## 6       FLASH FLOOD      17562129167

We can plot the storm event types with the highest economic consequences.

library(scales)
ggplot(top10, aes(x=EVTYPE, y=eventTotalDamage, fill=eventTotalDamage)) + 
geom_bar(stat='identity') + 
theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
labs(y="$ value of damage") +
labs(x="Storm Event Types") +
labs(title="Storm event types with highest economic consequences") +
scale_y_continuous(labels=comma) +
scale_fill_continuous(name="Economic Value",
                    labels=comma)