In this analysis, we sought to answer the following questions:
1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
2. Across the United States, which types of events have the greatest economic consequences?
For population health, tornados resulted in the greatest impact on event fatality, injury, and composite casualty totals. As for economic consequences, flooding appeared to have the greatest impact.
Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
To start, data from the NOAA database are read in from their online repository, and we can inspect the general organization of these data.
# reading in data
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", destfile = "./FStormData.csv.bz2", method = "curl")
stormdata <- read.csv("FStormData.csv.bz2")
# show first few fines of dataset, and show variable characteristics
head(stormdata)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO 0 0
## 2 TORNADO 0 0
## 3 TORNADO 0 0
## 4 TORNADO 0 0
## 5 TORNADO 0 0
## 6 TORNADO 0 0
## COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1 NA 0 14.0 100 3 0 0
## 2 NA 0 2.0 150 2 0 0
## 3 NA 0 0.1 123 2 0 0
## 4 NA 0 0.0 100 2 0 0
## 5 NA 0 0.0 150 2 0 0
## 6 NA 0 1.5 177 2 0 0
## INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1 15 25.0 K 0
## 2 0 2.5 K 0
## 3 2 25.0 K 0
## 4 2 2.5 K 0
## 5 2 2.5 K 0
## 6 6 2.5 K 0
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3040 8812 3051 8806 1
## 2 3042 8755 0 0 2
## 3 3340 8742 0 0 3
## 4 3458 8626 0 0 4
## 5 3412 8642 0 0 5
## 6 3450 8748 0 0 6
dim(stormdata)
## [1] 902297 37
str(stormdata)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels ""," Christiansburg",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels ""," CANTON"," TULIA",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","%SD",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436781 levels "","\t","\t\t",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
This is a relatively large data set, with many variables that are not immediately pertinent to the objectives of this analysis, so they will be excluded henceforth for simplicity and speed of processing.
# selecting high-yield columns
stormdata <- stormdata %>%
select(EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP, REMARKS)
# convert PROPDMGEXP to actual numbers
value <- function(x) {
x <- tolower(x)
if (x == "k") res <- 1000
if (x == "m") res <- 1e+06
if (x == "b") res <- 1e+09
else res <- 1
res
}
stormdata$PROP_DMG <- stormdata$PROPDMG * sapply(stormdata$PROPDMGEXP, value) /1000000
stormdata$CROP_DMG <- stormdata$CROPDMG * sapply(stormdata$CROPDMGEXP, value) /1000000
stormdata$TOTAL_DMG <- stormdata$PROP_DMG + stormdata$CROP_DMG
# casualties = composite measure of fatalities + injuries
stormdata$CASUALTIES <- stormdata$FATALITIES + stormdata$INJURIES
# again, show first few fines of dataset, and show variable characteristics
head(stormdata)
## EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 TORNADO 0 15 25.0 K 0
## 2 TORNADO 0 0 2.5 K 0
## 3 TORNADO 0 2 25.0 K 0
## 4 TORNADO 0 2 2.5 K 0
## 5 TORNADO 0 2 2.5 K 0
## 6 TORNADO 0 6 2.5 K 0
## REMARKS PROP_DMG CROP_DMG TOTAL_DMG CASUALTIES
## 1 0.0000250 0 0.0000250 15
## 2 0.0000025 0 0.0000025 0
## 3 0.0000250 0 0.0000250 2
## 4 0.0000025 0 0.0000025 2
## 5 0.0000025 0 0.0000025 2
## 6 0.0000025 0 0.0000025 6
dim(stormdata)
## [1] 902297 12
str(stormdata)
## 'data.frame': 902297 obs. of 12 variables:
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REMARKS : Factor w/ 436781 levels "","\t","\t\t",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ PROP_DMG : num 0.000025 0.0000025 0.000025 0.0000025 0.0000025 0.0000025 0.0000025 0.0000025 0.000025 0.000025 ...
## $ CROP_DMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ TOTAL_DMG : num 0.000025 0.0000025 0.000025 0.0000025 0.0000025 0.0000025 0.0000025 0.0000025 0.000025 0.000025 ...
## $ CASUALTIES: num 15 0 2 2 2 6 1 0 15 0 ...
First, we will look at the number of fatalities with each event type.
# calculate total number of fatalities
totalfatalities <- stormdata %>%
summarise(total = sum(FATALITIES))
# calculate total fatalities for each event type (over entire dataset), and display the top event types
fatalitydata <- stormdata %>%
group_by(EVTYPE, ) %>%
summarise(events = n(), total = sum(FATALITIES)) %>%
arrange(desc(total))
topfatality <- fatalitydata[1:5,]
topfatality
## # A tibble: 5 x 3
## EVTYPE events total
## <fct> <int> <dbl>
## 1 TORNADO 60652 5633
## 2 EXCESSIVE HEAT 1678 1903
## 3 FLASH FLOOD 54277 978
## 4 HEAT 767 937
## 5 LIGHTNING 15754 816
We will break this down by showing the total number of fatalities for each weather event type, starting from most deadly event type to least. This shows that tornadoes resulted in the greatest number of fatalities, followed by excessive heat.
It is also worth looking at the total number of non-fatal injuries for each weather event type. We will also show the top event types leading to the greatest number of injuries.
# calculate total number of injuries
totalinjuries <- stormdata %>%
summarise(total = sum(INJURIES))
# calculate total injuries for each event type (over entire dataset), and display the top event types
injurydata <- stormdata %>%
group_by(EVTYPE) %>%
summarise(n = n(), total = sum(INJURIES)) %>%
arrange(desc(total))
topinjury <- injurydata[1:5,]
topinjury
## # A tibble: 5 x 3
## EVTYPE n total
## <fct> <int> <dbl>
## 1 TORNADO 60652 91346
## 2 TSTM WIND 219940 6957
## 3 FLOOD 25326 6789
## 4 EXCESSIVE HEAT 1678 6525
## 5 LIGHTNING 15754 5230
Again, tornadoes are at the top of the list, and are responsible for the greatest number of injuries.
Finally, we will demonstrate which events account for the most casualties, defined as the sum of fatalities and injuries.
# calculate total casualties for each event type (over entire dataset), and display the top event types
casualtydata <- stormdata %>%
group_by(EVTYPE) %>%
summarise(n = n(), total = sum(CASUALTIES)) %>%
arrange(desc(total))
topcasualty <- casualtydata[1:5,]
topcasualty
## # A tibble: 5 x 3
## EVTYPE n total
## <fct> <int> <dbl>
## 1 TORNADO 60652 96979
## 2 EXCESSIVE HEAT 1678 8428
## 3 TSTM WIND 219940 7461
## 4 FLOOD 25326 7259
## 5 LIGHTNING 15754 6046
Unsurprisingly, tornadoes account for the greatest number of casualties, with excessive heat as a runner-up. The data for fatalities, injuries, and casualties are displayed graphically below (Fig 1).
# graph of fatality/injury/casualty data
fig1a <- ggplot(topfatality, aes(x = EVTYPE, y = total, fill = EVTYPE)) +
geom_bar(stat = "identity") +
xlab("Top 5 events") +
ylab("Fatalties") +
ggtitle("Figure 1a. Severe weather event fatalities in USA from 1950-2011") +
theme(axis.text.x = element_text(angle = 45, vjust=0.5), legend.position = "none")
fig1b <- ggplot(topinjury, aes(x = EVTYPE, y = total, fill = EVTYPE)) +
geom_bar(stat = "identity") +
xlab("Top 5 events") +
ylab("Injuries") +
ggtitle("Figure 1b. Severe weather event injuries in USA from 1950-2011") +
theme(axis.text.x = element_text(angle = 45, vjust=0.5), legend.position = "none")
fig1c <- ggplot(topcasualty, aes(x = EVTYPE, y = total, fill = EVTYPE)) +
geom_bar(stat = "identity") +
xlab("Top 5 events") +
ylab("Casualties") +
ggtitle("Figure 1c. Severe weather event casualties in USA from 1950-2011") +
theme(axis.text.x = element_text(angle = 45, vjust=0.5), legend.position = "none")
grid.arrange(fig1a, fig1b, fig1c, nrow = 3)
### Across the United States, which types of events have the greatest economic consequences? We will similarly examine the impact of severe weather events on damages to property, crops, and both (total damage).
# calculate total property damage for each event type (over entire dataset), and display the top event types
propertydata <- stormdata %>%
group_by(EVTYPE, ) %>%
summarise(events = n(), total = sum(PROP_DMG)) %>%
arrange(desc(total))
topproperty <- propertydata[1:5,]
topproperty
## # A tibble: 5 x 3
## EVTYPE events total
## <fct> <int> <dbl>
## 1 FLOOD 25326 122501.
## 2 HURRICANE/TYPHOON 88 65500.
## 3 STORM SURGE 261 42560.
## 4 HURRICANE 174 5700.
## 5 TORNADO 60652 5303.
Flooding appears to have caused the greatest property damage.
# calculate total property damage for each event type (over entire dataset), and display the top event types
cropdata <- stormdata %>%
group_by(EVTYPE, ) %>%
summarise(events = n(), total = sum(CROP_DMG)) %>%
arrange(desc(total))
topcrop <- cropdata[1:5,]
topcrop
## # A tibble: 5 x 3
## EVTYPE events total
## <fct> <int> <dbl>
## 1 RIVER FLOOD 173 5000.
## 2 ICE STORM 2006 5000.
## 3 HURRICANE/TYPHOON 88 1510.
## 4 DROUGHT 2488 1500.
## 5 HEAT 767 400.
Flooding also appears to have caused the greatest damage to crops.
# calculate total property damage for each event type (over entire dataset), and display the top event types
totaldata <- stormdata %>%
group_by(EVTYPE, ) %>%
summarise(events = n(), total = sum(TOTAL_DMG)) %>%
arrange(desc(total))
toptotal <- totaldata[1:5,]
toptotal
## # A tibble: 5 x 3
## EVTYPE events total
## <fct> <int> <dbl>
## 1 FLOOD 25326 122501.
## 2 HURRICANE/TYPHOON 88 67010.
## 3 STORM SURGE 261 42560.
## 4 RIVER FLOOD 173 10000.
## 5 HURRICANE 174 5700.
Flooding appears to have caused the greatest economic damage, overall.
ggplot(toptotal, aes(x = EVTYPE, y = total, fill = EVTYPE)) +
geom_bar(stat = "identity") +
xlab("Top 5 events") +
ylab("Property and crop damage ($)") +
ggtitle("Figure 2. Severe weather event economic impact in USA from 1950-2011") +
theme(axis.text.x = element_text(angle = 45, vjust=0.5), legend.position = "none")
In summary, tornadoes had the greatest impact on population health (i.e., event fatalities, injuries, and composite casualties). As for economic consequences, flooding appeared to have the greatest impact.
There are some important limitations to this research. For instance, there is significant overlap among the various event types that severe weather events were categorized into. As such, the totals for each outcome examined may have been affected. In a future analysis, it would likely be useful to recategorize similar events together for a more complete understanding of impact on outcomes of question.