This project explores the NOAA storm database in order to assess the effect of severe weather events on the economy and population of the U.S. Of particular interest are the following two questions:
We found that tornadoes are the most harmful events to population health, causing the most injuries and deaths. As far as economic health is concerned, floods caused the most property damage and droughts caused the most crop damage.
The data for this assignment is available here. It contains characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and crop and property damage. The events in the database start in the year 1950 and end in November 2011.
Load the data unless it already is in the workspace. We assume that the compressed file resides in the current directory.
library(readr)
if (!"stormdata" %in% ls())
stormdata <- read_csv("repdata_data_StormData.csv.bz2")
Examine the structure of the dataset.
str(stormdata)
## Classes 'tbl_df', 'tbl' and 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr NA NA NA NA ...
## $ BGN_LOCATI: chr NA NA NA NA ...
## $ END_DATE : chr NA NA NA NA ...
## $ END_TIME : chr NA NA NA NA ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: chr NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr NA NA NA NA ...
## $ END_LOCATI: chr NA NA NA NA ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr NA NA NA NA ...
## $ WFO : chr NA NA NA NA ...
## $ STATEOFFIC: chr NA NA NA NA ...
## $ ZONENAMES : chr NA NA NA NA ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr NA NA NA NA ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
Since we are most concerned with weather effects on population and economic health across the whole country as well as across time we ignore all time and location related variables and concentrate on the following variables:
EVTYPE - the event type, e.g. “Tornado”, etc.FATALITIES - the number of deaths caused by the eventINJURIES - the number of injuries caused by the eventPROPDMG - together with PROPDMGEXP specifies the amount of property damage caused by the eventPROPDMGEXP - the exponent with which to scale the value of PROPDMGCROPDMG - together with CROPDMGEXP specifies the amount of crop damage caused by the eventCROPDMGEXP - the exponent with which to scale the value of CROPDMGlibrary(dplyr)
stormdata <- select(stormdata, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)
Note: while the data in the earlier years is less complete than that of more recent years we choose to consider the full span of years nonetheless.
There are many event types, 977 to be precise:
length(unique(stormdata$EVTYPE))
## [1] 977
Many of them really refer to the same type, e.g. TORNADO and TORNADOS:
unique(stormdata[grep("^.*TORNADO.*$", stormdata$EVTYPE, ignore.case = TRUE),]$EVTYPE)
## [1] "TORNADO" "TORNADO F0"
## [3] "TORNADOS" "WATERSPOUT/TORNADO"
## [5] "WATERSPOUT TORNADO" "WATERSPOUT-TORNADO"
## [7] "TORNADOES, TSTM WIND, HAIL" "COLD AIR TORNADO"
## [9] "WATERSPOUT/ TORNADO" "TORNADO F3"
## [11] "TORNADO F1" "TORNADO/WATERSPOUT"
## [13] "TORNADO F2" "TORNADOES"
## [15] "TORNADO DEBRIS"
We re-label all the tornado-like events as TORNADO and similarly for all the hurricane-like and flood-like events.
stormdata[grep("^.*TORNADO.*$", stormdata$EVTYPE, ignore.case = TRUE),]$EVTYPE = "TORNADO"
stormdata[grep("^.*HURRICANE.*$", stormdata$EVTYPE, ignore.case = TRUE),]$EVTYPE = "HURRICANE"
stormdata[grep("^.*FLOOD.*$", stormdata$EVTYPE, ignore.case = TRUE),]$EVTYPE = "FLOOD"
A thorough analysis would take into consideration all event types but this is hardly the goal of this particular exercise.
We compute the total numbers of injuries and fatalities by event type and sort the rows in descending order.
totInjuries.by.eventtype <- group_by(stormdata, EVTYPE) %>%
summarise(totInjuries = sum(INJURIES)) %>%
arrange(desc(totInjuries))
totFatalities.by.eventtype <- group_by(stormdata, EVTYPE) %>%
summarise(totFatalities = sum(FATALITIES)) %>%
arrange(desc(totFatalities))
totInjuries.by.eventtype[1:10,]
## Source: local data frame [10 x 2]
##
## EVTYPE totInjuries
## (chr) (dbl)
## 1 TORNADO 91407
## 2 FLOOD 8604
## 3 TSTM WIND 6957
## 4 EXCESSIVE HEAT 6525
## 5 LIGHTNING 5230
## 6 HEAT 2100
## 7 ICE STORM 1975
## 8 THUNDERSTORM WIND 1488
## 9 HAIL 1361
## 10 HURRICANE 1328
totFatalities.by.eventtype[1:10,]
## Source: local data frame [10 x 2]
##
## EVTYPE totFatalities
## (chr) (dbl)
## 1 TORNADO 5661
## 2 EXCESSIVE HEAT 1903
## 3 FLOOD 1525
## 4 HEAT 937
## 5 LIGHTNING 816
## 6 TSTM WIND 504
## 7 RIP CURRENT 368
## 8 HIGH WIND 248
## 9 AVALANCHE 224
## 10 WINTER STORM 206
We want to display bar graphs of the number of injuries and fatalities for the most severe weather events, say the “top” 10 event types. Thus we need to reorder the levels of the factors totInjuries.by.eventtype$EVTYPE and totFatalities.by.eventtype$EVTYPE in decreasing order of totInjuries and totFatalities. However, specifying decreasing = TRUE as the argument to sort has no effect so we simply reverse the order of the levels.
totInjuries.by.eventtype$EVTYPE <- reorder(totInjuries.by.eventtype$EVTYPE,
totInjuries.by.eventtype$totInjuries,
FUN = sort)
totInjuries.by.eventtype$EVTYPE =
factor(totInjuries.by.eventtype$EVTYPE,
levels = rev(levels(totInjuries.by.eventtype$EVTYPE)))
totFatalities.by.eventtype$EVTYPE <- reorder(totFatalities.by.eventtype$EVTYPE,
totFatalities.by.eventtype$totFatalities,
FUN = sort)
totFatalities.by.eventtype$EVTYPE =
factor(totFatalities.by.eventtype$EVTYPE,
levels = rev(levels(totFatalities.by.eventtype$EVTYPE)))
We now plot the worst 10 weather events for the number of injuries and fatalities.
library(ggplot2)
totInjuries.plot <- ggplot(totInjuries.by.eventtype[1:10,], aes(x=EVTYPE, y=totInjuries)) +
geom_bar(stat = "identity", fill="lightblue", colour="black") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
xlab("Event Type") +
ylab("Total number of injuries") +
ggtitle("Top 10 weather events in terms of injuries")
totFatalities.plot <- ggplot(totFatalities.by.eventtype[1:10,], aes(x=EVTYPE, y=totFatalities)) +
geom_bar(stat = "identity", fill="lightblue", colour="black") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
theme(plot.margin = unit(c(0.2,0,1,0.1), "cm")) +
xlab("Event Type") +
ylab("Total number of fatalities") +
ggtitle("Top 10 weather events in terms of fatalities")
library(grid)
library(gridExtra)
grid.arrange(totInjuries.plot, totFatalities.plot, ncol=2)
We compute the damage (in dollars) caused by the various event types to property and crops. We use the variables PROPDMG and PROPDMGEXP for property damage and CROPDMG and CROPDMGEXP for crop damage. The variables PROPDMG and PROPDMGEXP encode the actual amount as \(x * 10^y\) where \(x\) is the value stored in PROPDMG and \(y\) the value stored in PROPDMGEXP. Similarly for CROPDMG and CROPDMGEXP. However, the variables PROPDMGEXP and CROPDMGEXP hold many NA values together with several invalid quantities. Furthermore, valid values can be either a number or a letter code such as “K” for \(10^3\) or “M” for \(10^6\).
unique(stormdata$PROPDMGEXP)
## [1] "K" "M" NA "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-"
## [18] "1" "8"
unique(stormdata$CROPDMGEXP)
## [1] NA "M" "K" "m" "B" "?" "0" "k" "2"
We replace invalid values with 0 and the letter codes with the corresponding exponent.
stormdata$PROPDMGEXP[which(is.na(stormdata$PROPDMGEXP))] <- 0
stormdata$PROPDMGEXP[which(stormdata$PROPDMGEXP %in% c("+","?","-"))] <- 0
stormdata$PROPDMGEXP[which(stormdata$PROPDMGEXP %in% c("H","h"))] <- 2
stormdata$PROPDMGEXP[which(stormdata$PROPDMGEXP == "K")] <- 3
stormdata$PROPDMGEXP[which(stormdata$PROPDMGEXP %in% c("M","m"))] <- 6
stormdata$PROPDMGEXP[which(stormdata$PROPDMGEXP == "B")] <- 9
stormdata$CROPDMGEXP[which(is.na(stormdata$CROPDMGEXP))] <- 0
stormdata$CROPDMGEXP[which(stormdata$CROPDMGEXP == "?")] <- 0
stormdata$CROPDMGEXP[which(stormdata$CROPDMGEXP %in% c("K","k"))] <- 3
stormdata$CROPDMGEXP[which(stormdata$CROPDMGEXP %in% c("M","m"))] <- 6
stormdata$CROPDMGEXP[which(stormdata$CROPDMGEXP == "B")] <- 9
We now create two new variables PropDmgAmount and CropDmgAmount to hold the actual damage amounts.
stormdata <- mutate(stormdata,
PropDmgAmount = PROPDMG * 10^as.numeric(PROPDMGEXP),
CropDmgAmount = CROPDMG * 10^as.numeric(CROPDMGEXP))
We tally the property and crop damage amounts by event type. This is entirely similar to what was done above for the number of injuries and fatalities.
totPropDmg.by.eventtype <- group_by(stormdata, EVTYPE) %>%
summarise(totPropDmg = sum(PropDmgAmount)) %>%
arrange(desc(totPropDmg))
totPropDmg.by.eventtype$EVTYPE <- reorder(totPropDmg.by.eventtype$EVTYPE,
totPropDmg.by.eventtype$totPropDmg,
FUN = sort)
totPropDmg.by.eventtype$EVTYPE =
factor(totPropDmg.by.eventtype$EVTYPE, levels = rev(levels(totPropDmg.by.eventtype$EVTYPE)))
totCropDmg.by.eventtype <- group_by(stormdata, EVTYPE) %>%
summarise(totCropDmg = sum(CropDmgAmount)) %>%
arrange(desc(totCropDmg))
totCropDmg.by.eventtype$EVTYPE <- reorder(totCropDmg.by.eventtype$EVTYPE,
totCropDmg.by.eventtype$totCropDmg,
FUN = sort)
totCropDmg.by.eventtype$EVTYPE =
factor(totCropDmg.by.eventtype$EVTYPE, levels = rev(levels(totCropDmg.by.eventtype$EVTYPE)))
Here are the top 10 worst weather events for property damage:
totPropDmg.by.eventtype[1:10,]
## Source: local data frame [10 x 2]
##
## EVTYPE totPropDmg
## (fctr) (dbl)
## 1 FLOOD 168212215835
## 2 HURRICANE 84756180010
## 3 TORNADO 58603317927
## 4 STORM SURGE 43323536000
## 5 HAIL 15735267513
## 6 TROPICAL STORM 7703890550
## 7 WINTER STORM 6688497251
## 8 HIGH WIND 5270046295
## 9 WILDFIRE 4765114000
## 10 STORM SURGE/TIDE 4641188000
and the top 10 worst for crop damage:
totCropDmg.by.eventtype[1:10,]
## Source: local data frame [10 x 2]
##
## EVTYPE totCropDmg
## (fctr) (dbl)
## 1 DROUGHT 13972566000
## 2 FLOOD 12380109100
## 3 HURRICANE 5515292800
## 4 ICE STORM 5022113500
## 5 HAIL 3025954473
## 6 EXTREME COLD 1292973000
## 7 FROST/FREEZE 1094086000
## 8 HEAVY RAIN 733399800
## 9 TROPICAL STORM 678346000
## 10 HIGH WIND 638571300
Finally, we display bar graphs of the 10 worst events:
totPropDmg.plot <- ggplot(totPropDmg.by.eventtype[1:10,], aes(x=EVTYPE, y=totPropDmg)) +
geom_bar(stat = "identity", fill="lightblue", colour="black") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
xlab("Event Type") +
ylab("Total amount of property damage ($)") +
ggtitle("Top 10 weather events\n in terms of property damage")
totCropDmg.plot <- ggplot(totCropDmg.by.eventtype[1:10,], aes(x=EVTYPE, y=totCropDmg)) +
geom_bar(stat = "identity", fill="lightblue", colour="black") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
theme(plot.margin = unit(c(0.2,0,0.5,0.2), "cm")) +
xlab("Event Type") +
ylab("Total amount of crop damage ($)") +
ggtitle("Top 10 weather events\n in terms of crop damage")
grid.arrange(totPropDmg.plot, totCropDmg.plot, ncol=2)
We see that tornadoes caused the most injuries and deaths while flood caused the most property damage and drought caused the most crop damage.