Arash Tavassoli | September 4, 2018
In this report we aim to explore the available data from US National Oceanic and Atmospheric Administration’s (NOAA) Storm Database from years of 1950 to 2011 to identify the most harmful events in terms of public health (fatalities and injuries) as well as economic impacts on the society. From this analysis it is found that tornado was by far the most harmful event in that period of time with significantly higher fatalities and injuries, while floods have shown to be the most harmful events in terms of economic loss (i.e. building damages and crop damages in total).
This project explors the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. The file is downloaded from the course website (if not downloaded already):
if(!file.exists("repdata-data-StormData.csv.bz2")) {
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
destfile = "repdata-data-StormData.csv.bz2", method = "curl")
}
Before proceeding the required package(s) are loaded into R:
library(dplyr)
The data is then loaded into R using the read.csv() function:
raw.data <- read.csv("repdata-data-StormData.csv.bz2")
The names() function is called to see the variables in the dataset:
names(raw.data)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
Out of the 37 variables in the dataset only the following 7 are of interest for the current assignement:
| Variable name | Description |
|---|---|
| EVTYPE | Event Type |
| FATALITIES | Number of fatalities |
| INJURIES | Number of injuries |
| PROPDMG | Estimated property damage |
| PROPDMGEXP | Character signifying the magnitude of the number (K, M, B) |
| CROPDMG | Estimated crop damage |
| CROPDMGEXP | Character signifying the magnitude of the number (K, M, B) |
Therefore the dataset is filtered to include only the 7 variables. This assists with further analysis of the data:
data <- select(raw.data, EVTYPE, FATALITIES:CROPDMGEXP)
Next step is to define the values of the estimated property and crop damage using the magnitude indexes (“K” for thousands, “M” for millions, and “B” for billions). This is done through generation of two new variables called PROPtotDMG and CROPtotDMG for the propoerty and crop damages, respectively.
First a summary table of the current values for the magnitude index variables PROPDMGEXP and CROPDMGEXP is generated:
table(data$PROPDMGEXP)
##
## - ? + 0 1 2 3 4 5
## 465934 1 8 5 216 25 13 4 4 28
## 6 7 8 B h H K m M
## 4 5 1 40 1 6 424665 7 11330
table(data$CROPDMGEXP)
##
## ? 0 2 B k K m M
## 618413 7 19 1 9 21 281832 1 1994
It is assumed that the indexes other than “K”, “M” and “B” are meant to be equal to 1 (i.e. similar to blank values).
Using a for() loop the indices are converted to corresponding amplification factors:
exp <- c("", "-", "?", "+", 0, 2, 3, 4, 5, 6, 7, 8, "h", "H", "k", "K", "m", "M", "B")
num.exp <- c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 10^3, 10^3, 10^6, 10^6, 10^9)
data$PROPDMGEXP <- as.character(data$PROPDMGEXP)
data$CROPDMGEXP <- as.character(data$CROPDMGEXP)
for(i in 1:length(exp)) {
data$PROPDMGEXP [data$PROPDMGEXP == exp[i]] <- num.exp[i]
data$CROPDMGEXP [data$CROPDMGEXP == exp[i]] <- num.exp[i]
}
data$PROPDMGEXP <- as.numeric(data$PROPDMGEXP)
data$CROPDMGEXP <- as.numeric(data$CROPDMGEXP)
An updated summary table of the values for the magnitude index variables PROPDMGEXP and CROPDMGEXP is generated to ensure all indices are correctly converted:
table(data$PROPDMGEXP)
##
## 1 1000 1e+06 1e+09
## 466255 424665 11337 40
table(data$CROPDMGEXP)
##
## 1 1000 1e+06 1e+09
## 618440 281853 1995 9
The last processing step is to multiply the estimated property and crop damages with the designated magnitude indices and save the resulting values in new PROPtotDMG and CROPtotDMG variables:
data$PROPtotDMG <- data$PROPDMG * data$PROPDMGEXP
data$CROPtotDMG <- data$CROPDMG * data$CROPDMGEXP
The PROPDMG, PROPDMGEXP, CROPDMG and CROPDMGEXP are no longer required and can be removed from the dataset:
data$PROPDMG <- NULL
data$PROPDMGEXP <- NULL
data$CROPDMG <- NULL
data$CROPDMGEXP <- NULL
A quick look at the dataset:
head(data)
## EVTYPE FATALITIES INJURIES PROPtotDMG CROPtotDMG
## 1 TORNADO 0 15 25000 0
## 2 TORNADO 0 0 2500 0
## 3 TORNADO 0 2 25000 0
## 4 TORNADO 0 2 2500 0
## 5 TORNADO 0 2 2500 0
## 6 TORNADO 0 6 2500 0
The key focus of the current projet is to answer the following two question:
To be able to answer these questions, the tidy dataset from the previous section is grouped by the event type variable EVTYPE:
event.data <- group_by(data, EVTYPE)
The top 10 event types based on “total fatalities”, “total injuries” and “total economic loss” are identified using a combination of arrange() and summarize() functions:
# Top 10 event types with maximum number of total fatalities:
top.fatal <- arrange(summarize(event.data, sum = sum(FATALITIES)), desc(sum))[1:10,]
# Top 10 event types with maximum number of total injuries:
top.injur <- arrange(summarize(event.data, sum = sum(INJURIES)), desc(sum))[1:10,]
# Top 10 event types with maximum economic loss (buiding and crop damage in total):
top.eco <- arrange(summarize(event.data, sum = sum(PROPtotDMG + CROPtotDMG)), desc(sum))[1:10,]
Eventually barplots are constructed using the base plotting system to picture the top 10 most harmful events with respect to population health (fatalities and injuries on separate plots) and the top 10 events with the greatest economic consequences:
par(mar = c(8,6,4,2), mgp = c(4,1,0))
barplot(top.fatal$sum, names.arg = top.fatal$EVTYPE,
main = "The most harmful events with respect to population health (total fatalities)",
ylab = "Total fatalities",
cex.main = 0.9, cex.axis = .8, cex.names=0.7,
col = "blue", las=2)
par(mar = c(8,6,4,2), mgp = c(4,1,0))
barplot(top.injur$sum, names.arg = top.injur$EVTYPE,
main = "The most harmful events with respect to population health (total injuries)",
ylab = "Total injuries",
cex.main = 0.9, cex.axis = .8, cex.names=0.7,
col = "red", las=2)
Tornado is found to be the most harmful event type with respect to population health (fatalities and injuries).
par(mar = c(8,6,4,2), mgp = c(4,1,0))
barplot(top.eco$sum, names.arg = top.eco$EVTYPE,
main = "The most harmful events with the greatest economic consequences",
ylab = "Total damages in $",
cex.main = 0.9, cex.axis = .8, cex.names=0.7,
col = "orange", las=2)
Flooding is shown to have the most adverse impact in terms of financial damage to the properties and crop.