This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. We are interested in the following two questions:
The analysis shows that the answer to the first question is tornado and the answer to the second question is flood.
At the end of this document, the top 10 events related to each question are listed for further information.
Read in the NOAA storm data into a data frame object and cache result.
noaa <- read.csv("repdata-data-StormData.csv.bz2")
head(noaa, 3)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO 0 0
## 2 TORNADO 0 0
## 3 TORNADO 0 0
## COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1 NA 0 14.0 100 3 0 0
## 2 NA 0 2.0 150 2 0 0
## 3 NA 0 0.1 123 2 0 0
## INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1 15 25.0 K 0
## 2 0 2.5 K 0
## 3 2 25.0 K 0
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3040 8812 3051 8806 1
## 2 3042 8755 0 0 2
## 3 3340 8742 0 0 3
What are the names of the data?
names(noaa)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
The first question asks the most harmful types of events. After looking at the names of the data columns, I think “FATALITIES” and “INJUERIES” relate to harmfulness of an event.
Explore the values of “FATALITIES” and “INJUERIES”.
fatalities <- noaa$FATALITIES
injuries <- noaa$INJURIES
sum(fatalities == 0)
## [1] 895323
sum(injuries == 0)
## [1] 884693
There are many 0 values in the two columns. Subset the data set for non-zero of “FATALITIES” or “INJURIES”
data1 = subset(noaa, !((FATALITIES == 0) & (INJURIES == 0)))
dim(data1)
## [1] 21929 37
Create a column called “harm” that is the total fatalie and injuries for each event. Subset only events and harms for answering the question 1.
fatalities <-as.integer(data1$FATALITIES)
injuries <- as.integer(data1$INJURIES)
harm <- fatalities + injuries
data1 <- cbind(data1, as.data.frame(harm))
data1 <- data1[, c("EVTYPE", "harm")]
head(data1)
## EVTYPE harm
## 1 TORNADO 15
## 3 TORNADO 2
## 4 TORNADO 2
## 5 TORNADO 2
## 6 TORNADO 6
## 7 TORNADO 1
Group the data by EVTYPE and compute the sum of harm for each type of event.
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
harm_by_event <- group_by(data1, EVTYPE)
data1 <- summarize(harm_by_event, totalharm = sum(harm))
The second question asks the types of events that have the greatest economic consequences. After looking at the names of the data columns, I think “PROPDMG” and “CROPDMG” are related to economic consequences. Similary, I subset the data set for non-zero values of “PROPDMG” or “CROPDMG”
data2 = subset(noaa, !((PROPDMG == 0) & (CROPDMG == 0)))
dim(data2)
## [1] 245031 37
According to the code book provided by the project webpage, damage estimates are rounded to three significant digits, followed by an alphabetcal character signifying the magnitude of the number, i.e., 1.55B for $1,550,000,000. I first want to examine the content of the magnitudes in the column “RPOPDMGEXP” and “CROPDMGEXP.
propexp <- data2$PROPDMGEXP
cropexp <- data2$CROPDMGEXP
table(propexp)
## propexp
## - ? + 0 1 2 3 4 5
## 4357 1 0 5 209 0 1 1 4 18
## 6 7 8 B h H K m M
## 3 2 0 40 1 6 229057 7 11319
table(cropexp)
## cropexp
## ? 0 2 B k K m M
## 145037 6 17 0 7 21 97960 1 1982
Many characters for signifying magnitude are not recognizable. For this project, I will only analyze the damage values that have magnitude symbols in {B, h, H, K, m, M}
mag_symbols = c("B", "h", "H", "K", "m", "M")
data2 <- subset(data2, PROPDMGEXP %in% mag_symbols & CROPDMGEXP %in% mag_symbols)
Multiply the values of “PROPDMG” and “CROPDMG” by the corresponding magnitudes
map = setNames(c(1000000000, 100, 100, 1000, 1000000, 1000000), c("B", "h", "H", "K", "m", "M"))
propexp <- as.vector(data2$PROPDMGEXP)
cropexp <- as.vector(data2$CROPDMGEXP)
propmag <- as.integer(map[unlist(propexp)])
cropmag <- as.integer(map[unlist(cropexp)])
propval = as.integer(data2$PROPDMG) * (propmag / 1000000)
cropval = as.integer(data2$CROPDMG) * (cropmag / 1000000)
Create a column called “damage” and subset the data set only on events and damages for answering question 2
damage = propval + cropval
data2 <- cbind(data2, as.data.frame(damage))
data2 = data2[, c("EVTYPE", "damage")]
head(data2)
## EVTYPE damage
## 187566 HURRICANE OPAL/HIGH WINDS 10.0
## 187571 THUNDERSTORM WINDS 5.5
## 187581 HURRICANE ERIN 26.0
## 187583 HURRICANE OPAL 52.0
## 187584 HURRICANE OPAL 30.0
## 187653 THUNDERSTORM WINDS 0.1
Group the data by EVTYPE and compute the sum of damage for each type of event.
damage_by_event <- group_by(data2, EVTYPE)
data2 <- summarize(damage_by_event, totaldamage = sum(damage))
I have pre-processed the raw data by combining and extracting related information for answering the questions. The data set data1 contains the infomration about events and the total harm for each event. The data set data2 contains the information about events and the total economic damage for each event. The following plot shows the values of total harms and total economic damages with respect to the indices of events.
par(mfrow = c(1, 2), oma = c(0, 0, 3, 0))
plot(log10(data1$totalharm), type = "l", xlab = "Event Index", ylab = "Total Harm (log)")
plot(log10(data2$totaldamage), type = "l", xlab = "Event Index", ylab = "Total Economic Damage (log)")
mtext("Harms and Economic Damages vs. Event Indices", outer = TRUE, cex = 1.5)
Rank the harm and damage values and find the events related to the largest values.
data1sorted = arrange(data1, desc(totalharm))
data2sorted = arrange(data2, desc(totaldamage))
head(data1sorted[1], 10)
## Source: local data frame [10 x 1]
##
## EVTYPE
## 1 TORNADO
## 2 EXCESSIVE HEAT
## 3 TSTM WIND
## 4 FLOOD
## 5 LIGHTNING
## 6 HEAT
## 7 FLASH FLOOD
## 8 ICE STORM
## 9 THUNDERSTORM WIND
## 10 WINTER STORM
head(data2sorted[1], 10)
## Source: local data frame [10 x 1]
##
## EVTYPE
## 1 FLOOD
## 2 HURRICANE/TYPHOON
## 3 TORNADO
## 4 HURRICANE
## 5 RIVER FLOOD
## 6 HAIL
## 7 FLASH FLOOD
## 8 ICE STORM
## 9 STORM SURGE/TIDE
## 10 THUNDERSTORM WIND