In this report, we aim to highlight the weather events that are responsible for the greatest impact on population health and the economy. Data from U.S. National Oceanic and Atmospheric Administration’s (NOAA) database has been used to inform this report. The events in the database where taken from the years 1950 to end in November 2011. Our analysis shows that the top three most impactful weather events on the population is tornado, excessive heat and TSTM Wind (marine thunderstorm wind). The top three weather event types that cause the greatest econonomic consequences are flood, hurricane/typhoon and tornado.
From the Reproduceable Research Coursera website, we obtained the data for the analysis.
Documentation of how the database and data are constructed/defined can be found here:
National Weather Service Storm Data Document
National Climatic Data Center Storm Event FAQ
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database, there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
We first download the data. The data come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size.
stormDataUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
bzip2temp <- tempfile()
download.file(stormDataUrl, bzip2temp, method="curl")
rawStormData <- read.csv(bzip2temp)
unlink(bzip2temp)
After reading in the raw data, we check the first few rows (there are 902297 rows) in this dataset.
dim(rawStormData)
## [1] 902297 37
head(rawStormData[,1:20])
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO 0 0
## 2 TORNADO 0 0
## 3 TORNADO 0 0
## 4 TORNADO 0 0
## 5 TORNADO 0 0
## 6 TORNADO 0 0
## COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH
## 1 NA 0 14.0 100
## 2 NA 0 2.0 150
## 3 NA 0 0.1 123
## 4 NA 0 0.0 100
## 5 NA 0 0.0 150
## 6 NA 0 1.5 177
To analyse which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health, we focus on 3 columns in the dataset:
EVTYPE which describes the weather events
FATALITIES which summarises the number of fatalities attributed to the event
INJURIES which summarises the number of injuries attribute to the event.
We start by summing the total fatalities and injuries per event type and adding these two variables to indicate the total number of people negatively affected by the event. We are only interested in those events where there is at least one person who suffered fatality or injury.
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(reshape2)
library(ggplot2)
## some options setting
options(scipen=999)
options(width=120)
healthEventSummary <- group_by(rawStormData, EVTYPE) %>%
summarise(total_fatalities=sum(FATALITIES),total_injuries=sum(INJURIES)) %>%
mutate(total_affected=total_fatalities+total_injuries) %>%
filter(total_affected > 0)
We take the events which total affected is in the top 10 quantile.
healthEventSummary <- subset(healthEventSummary, total_affected >= quantile(total_affected,0.9))
Number of rows in this subset is:
nrow(healthEventSummary)
## [1] 22
In order to see the top most harmful events, we can sort the base on total affected in a descending order and display the top few records.
head(arrange(healthEventSummary, desc(total_affected)))
## Source: local data frame [6 x 4]
##
## EVTYPE total_fatalities total_injuries total_affected
## 1 TORNADO 5633 91346 96979
## 2 EXCESSIVE HEAT 1903 6525 8428
## 3 TSTM WIND 504 6957 7461
## 4 FLOOD 470 6789 7259
## 5 LIGHTNING 816 5230 6046
## 6 HEAT 937 2100 3037
To give us a better view of the top number, we plot a stacked bar graph
## remove the total affected columns so
healthEventSummary<- select(healthEventSummary, -total_affected)
## reshape the data into a long format
meltedTop10HealthEvents <- melt(healthEventSummary, id.vars="EVTYPE")
## draw the plot
ggplot(meltedTop10HealthEvents, aes(x=EVTYPE, y=value, fill=variable)) +
geom_bar(stat='identity') +
scale_fill_discrete(name="Harm to population",
breaks=c("total_fatalities","total_injuries"),
labels=c("Fatalities","Injuries")) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(y="Number of people affected") +
labs(x="Storm Event Types") +
labs(title="Storm event types with highest harm to population")
From the plot above, it is very clear that tornados causes the most harm to the population.
To analyse which types of events (as indicated in the EVTYPE variable) have the greatest economic consequences, we focus on the following columns in the dataset:
EVTYPE which describes the weather events
PROPDMG which summarises the property damage estimates in actual dollar amounts, attributed to the event. Estimates are rounded to three significant digits.
PROPDMGEXP which indicate the magnitude of the number stored in PROPDMG.
CROPDMG which summarises crop damage estimates in actual dollar amounts, attributed to the event.
CROPDMGEXP which indicate the magnitude of the number stored in CROPDMG
With the raw data that was read at the beginning, we create a subset of data by getting records where the PROPDMG or CROPDMG value is greater than zero and the columns we are interested in. This cuts down the number of rows and columns for processing.
econSummary <- filter(rawStormData, PROPDMG> 0 | CROPDMG > 0) %>%
select(EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)
To compare the dollar value, we create a column to store the multiplier based on the magnitude and work out the actual damage value.
## column for storing the multiplier for property damage; assume 1 as the default multiplier
econSummary$propdamageExp <- 1
econSummary$propdamageExp[toupper(econSummary$PROPDMGEXP)=='H'] <- 100
econSummary$propdamageExp[toupper(econSummary$PROPDMGEXP)=='K'] <- 1000
econSummary$propdamageExp[toupper(econSummary$PROPDMGEXP)=='M'] <- 1000000
econSummary$propdamageExp[toupper(econSummary$PROPDMGEXP)=='B'] <- 1000000000
econSummary$propertyDamage <- econSummary$PROPDMG*econSummary$propdamageExp
## column for storing the multiplier for crop damage; assume 1 as the default multiplier
econSummary$cropdamageExp <- 1
econSummary$cropdamageExp[toupper(econSummary$CROPDMGEXP)=='H'] <- 100
econSummary$cropdamageExp[toupper(econSummary$CROPDMGEXP)=='K'] <- 1000
econSummary$cropdamageExp[toupper(econSummary$CROPDMGEXP)=='M'] <- 1000000
econSummary$cropdamageExp[toupper(econSummary$CROPDMGEXP)=='B'] <- 1000000000
econSummary$cropDamage <- econSummary$CROPDMG*econSummary$cropdamageExp
## add both the property damage and crop damage for the event
econSummary$totalDamage <- econSummary$propertyDamage + econSummary$cropDamage
To find the events types that has resulted in the greatest economic consequence, we sum the total damage per event type.
econTotals <- group_by(econSummary, EVTYPE) %>%
summarise(eventTotalDamage=sum(totalDamage))
We take the events which total damage value is in the top 10 quantile.
top10 <- subset(econTotals, eventTotalDamage >= quantile(eventTotalDamage,0.9))
Number of rows in this subset is:
nrow(top10)
## [1] 44
To see the event type that cause the greatest economic consequence, we can sort the the records in the top 10 quantile in descending order of the total damage and display the top few records.
head(arrange(top10, desc(eventTotalDamage)))
## Source: local data frame [6 x 2]
##
## EVTYPE eventTotalDamage
## 1 FLOOD 150319678257
## 2 HURRICANE/TYPHOON 71913712800
## 3 TORNADO 57352114049
## 4 STORM SURGE 43323541000
## 5 HAIL 18758222016
## 6 FLASH FLOOD 17562129167
We can plot the storm event types with the highest economic consequences.
library(scales)
ggplot(top10, aes(x=EVTYPE, y=eventTotalDamage, fill=eventTotalDamage)) +
geom_bar(stat='identity') +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(y="$ value of damage") +
labs(x="Storm Event Types") +
labs(title="Storm event types with highest economic consequences") +
scale_y_continuous(labels=comma) +
scale_fill_continuous(name="Economic Value",
labels=comma)