In this report we aim to answer some basic questions about severe weather events. Specifically, we try to identify which types of events are the most harmful to population health and the most deleterious to the economy. To answer these questions, we obtained the storm database from the U.S. National Oceanic and Atmospheric Administration’s (NOAA). This database tracks characteristics of major storms and weather events in the United States, including estimates of any fatalities, injuries, and property and crop damage. From these data, we found that tornadoes and heat are the severe weather event types by far most dangerous to people, while flooding, hurricanes, and storm surges are the most costly event types to the economy. Interestingly, only flooding is one of the top three most dangerous or most costly event types.
library(R.utils)
library(data.table)
library(ggplot2)
library(gridExtra)
library(dplyr)
library(quantmod)
options("getSymbols.warning4.0"=FALSE)
From the Coursera “Reproducible Research” course file repository, we obtain the storm data in bzip archive format and extract it.
bzFilename <- "repdata-data-StormData.csv.bz2"
fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(fileUrl, destfile = bzFilename)
bunzip2("repdata-data-StormData.csv.bz2", overwrite=T, remove=F)
We read in the extracted file, which is in CSV format, into a data frame, which we then convert to a data.table for efficiency of later operations.
storm = read.csv("E:/Aulas/Data Science/Módulo 5 - Reproducible Research/Semana 3/Trabalho/repdata-data-StormData.csv")
storm = data.table(storm)
We check the first few rows in the dataset, which should have 902297 rows in total.
dim(storm)
## [1] 902297 37
head(storm, n=3)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1: 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2: 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3: 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1: TORNADO 0 0
## 2: TORNADO 0 0
## 3: TORNADO 0 0
## COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1: NA 0 14.0 100 3 0 0
## 2: NA 0 2.0 150 2 0 0
## 3: NA 0 0.1 123 2 0 0
## INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1: 15 25.0 K 0
## 2: 0 2.5 K 0
## 3: 2 25.0 K 0
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1: 3040 8812 3051 8806 1
## 2: 3042 8755 0 0 2
## 3: 3340 8742 0 0 3
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
One concern is that there may be an inconsistence balance of event types recorded over the years. For example, maybe in the earlier years, deaths due to rip currents were not recorded. To allow for more fair comparisons, we want to restrict our analysis to years that demonstrate a large number of recorded events, as this may indicate better record-keeping.
So we count the number of events per year.
We convert the property damage and crop damage data into comparable numerical form, based on the meaning of units described in the code book (National Climatic Data Center’s record layout document, which is referenced on the Investigative Reporers & Editors web site.)
storm$PROPDMGEXP = as.character(storm$PROPDMGEXP)
storm$PROPDMGEXP[toupper(storm$PROPDMGEXP) == 'B'] = "9"
storm$PROPDMGEXP[toupper(storm$PROPDMGEXP) == 'M'] = "6"
storm$PROPDMGEXP[toupper(storm$PROPDMGEXP) == 'K'] = "3"
storm$PROPDMGEXP[toupper(storm$PROPDMGEXP) == 'H'] = "2"
storm$PROPDMGEXP = as.numeric(storm$PROPDMGEXP)
storm$PROPDMGEXP[is.na(storm$PROPDMGEXP)] = 0
storm$PropertyDamage = storm$PROPDMG * 10^storm$PROPDMGEXP
summary(storm$PropertyDamage)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000e+00 0.000e+00 0.000e+00 4.746e+05 5.000e+02 1.150e+11
storm$CROPDMGEXP = as.character(storm$CROPDMGEXP)
storm$CROPDMGEXP[toupper(storm$CROPDMGEXP) == 'B'] = "9"
storm$CROPDMGEXP[toupper(storm$CROPDMGEXP) == 'M'] = "6"
storm$CROPDMGEXP[toupper(storm$CROPDMGEXP) == 'K'] = "3"
storm$CROPDMGEXP[toupper(storm$CROPDMGEXP) == 'H'] = "2"
storm$CROPDMGEXP[toupper(storm$CROPDMGEXP) == ''] = "0"
storm$CROPDMGEXP = as.numeric(storm$CROPDMGEXP)
storm$CROPDMGEXP[is.na(storm$CROPDMGEXP)] = 0
storm$CropDamage = storm$CROPDMG * 10^storm$CROPDMGEXP
summary(storm$CropDamage)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000e+00 0.000e+00 0.000e+00 5.442e+04 0.000e+00 5.000e+09
We consider both property and crop damage as important components of the economic impact of weather events. So we sum them up, even though property damage figures dominates those for crop damage.
storm$TotalDamage = storm$PropertyDamage + storm$CropDamage
summary(storm$TotalDamage)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00e+00 0.00e+00 0.00e+00 5.29e+05 1.00e+03 1.15e+11
We adjust for inflation using the Consumer Price Index.
invisible(getSymbols("CPIAUCSL", src='FRED'))
yearlyCPI = apply.yearly(CPIAUCSL, mean)
yearlyCPI.in2011Dollars = as.numeric(yearlyCPI["2011"])/yearlyCPI
cpiAdj = as.data.frame(yearlyCPI.in2011Dollars)
cpiAdj$year = as.numeric(format(as.Date(rownames(cpiAdj)), "%Y"))
colnames(cpiAdj)[1] = "CPIConv"
head(cpiAdj, n = 3)
## CPIConv year
## 1947-12-01 10.072229 1947
## 1948-12-01 9.354530 1948
## 1949-12-01 9.447188 1949
## storm = merge(storm, cpiAdj, by='year')
## storm$TotalDamage = storm$TotalDamage * storm$CPIConv
Since the event types in the data are inconsistently named, we manually merge a few event types. Since there are close to 1000 different levels in the event type factor and it would be too time-consuming to process them all, we focus on about 80 of the most significant event types. This focus should not affect our analysis as we are seeking to identify only the most impactful event types. Event types are considered significant if they have high total damage.
For example, in the list below, we should combine “HURRICANE OPAL”, “HURRICANE” and “HURRICANE/TYPHOON”.
tail(sort(tapply(storm$TotalDamage, storm$EVTYPE, sum)), n=20)
## HEAVY RAIN/SEVERE WEATHER WILD/FOREST FIRE
## 2500000000 3108626330
## HURRICANE OPAL THUNDERSTORM WIND
## 3191846000 3897965522
## STORM SURGE/TIDE TSTM WIND
## 4642038000 5038935845
## WILDFIRE HIGH WIND
## 5060586800 5908617595
## WINTER STORM TROPICAL STORM
## 6715441251 8382236550
## ICE STORM RIVER FLOOD
## 8967041360 10148404500
## HURRICANE DROUGHT
## 14610229010 15018672000
## FLASH FLOOD HAIL
## 18243991079 18761221986
## STORM SURGE TORNADO
## 43323541000 57362333947
## HURRICANE/TYPHOON FLOOD
## 71913712800 150319678257
storm$EVTYPE[toupper(storm$EVTYPE) %in%
c("HURRICANE/TYPHOON",
"HURRICANE OPAL",
"HURRICANE OPAL/HIGH WINDS",
"HURRICANE EMILY",
"TYPHOON",
"HURRICANE ERIN")] = "HURRICANE"
storm$EVTYPE[toupper(storm$EVTYPE) %in%
c("TSTM WIND",
" TSTM WIND",
"SEVERE THUNDERSTORM WINDS",
"THUNDERSTORM WINDS")] = "THUNDERSTORM WIND"
storm$EVTYPE[toupper(storm$EVTYPE) %in%
c("HEAVY RAIN/SEVERE WEATHER",
"EXCESSIVE RAINFALL",
"UNSEASONAL RAIN",
"HEAVY RAINS")] = "HEAVY RAIN"
storm$EVTYPE[toupper(storm$EVTYPE) %in%
c("STORM SURGE/TIDE"
)] = "STORM SURGE"
storm$EVTYPE[toupper(storm$EVTYPE) %in%
c("WILD/FOREST FIRE",
"WILDFIRES",
"WILD FIRES")] = "WILDFIRE"
storm$EVTYPE[toupper(storm$EVTYPE) %in%
c("EXCESSIVE HEAT",
"HEAT WAVE",
"EXTREME HEAT",
"UNSEASONABLY WARM",
"RECORD/EXCESSIVE HEAT",
"RECORD HEAT")] = "HEAT"
storm$EVTYPE[toupper(storm$EVTYPE) %in%
c("EXTREME COLD",
"FROST/FREEZE",
"FROST",
"Early Frost ",
"DAMAGING FREEZE",
"RECORD COLD",
"COLD/WIND CHILL",
"EXTREME COLD/WIND CHILL",
"UNSEASONABLY COLD",
"Unseasonable Cold",
"HARD FREEZE",
"FREEZE")] = "COLD"
storm$EVTYPE[toupper(storm$EVTYPE) %in%
c("HIGH WINDS",
"HIGH WIND",
"BLOWING WIND",
"STRONG WINDS",
"STRONG WIND")] = "WIND"
storm$EVTYPE[toupper(storm$EVTYPE) %in%
c("FLASH FLOODING",
"FLASH FLOOD/FLOOD",
"FLOOD/FLASH FLOOD")] = "FLASH FLOOD"
storm$EVTYPE[toupper(storm$EVTYPE) %in%
c("SMALL HAIL")] = "HAIL"
storm$EVTYPE[toupper(storm$EVTYPE) %in%
c("RIVER FLOODING"
)] = "RIVER FLOOD"
storm$EVTYPE[toupper(storm$EVTYPE) %in%
c("FLOODING",
"MAJOR FLOOD")] = "FLOOD"
storm$EVTYPE[toupper(storm$EVTYPE) %in%
c("COASTAL FLOODING",
"COASTAL FLOODING/EROSION",
"COASTAL FLOODING/EROSION",
"Erosion/Cstl Flood",
"COASTAL FLOOD")] = "COASTAL FLOOD"
We also look at about 80 of the most significant event types with respect to injuries and fatalities and again merge the terms that represent the same type of event.
storm$TotalPeople = storm$INJURIES + storm$FATALITIES
tail(sort(tapply(storm$TotalPeople, storm$EVTYPE, sum)), n=20)
## WINTER WEATHER DUST STORM RIP CURRENTS RIP CURRENT
## 431 462 501 600
## COLD FOG BLIZZARD HEAVY SNOW
## 740 796 906 1148
## HAIL HURRICANE WINTER STORM WILDFIRE
## 1386 1463 1527 1696
## ICE STORM WIND FLASH FLOOD LIGHTNING
## 2064 2243 2828 6046
## FLOOD THUNDERSTORM WIND HEAT TORNADO
## 7267 10054 12364 96979
storm$EVTYPE[toupper(storm$EVTYPE) %in%
c("TROPICAL STORM GORDON",
"TROPICAL STORM JERRY")] = "TROPICAL STORM"
storm$EVTYPE[toupper(storm$EVTYPE) %in%
c("DENSE FOG"
)] = "FOG"
storm$EVTYPE[toupper(storm$EVTYPE) %in%
c("RIP CURRENTS"
)] = "RIP CURRENT"
storm$EVTYPE[toupper(storm$EVTYPE) %in%
c("HEAVY SURF",
"HEAVY SURF/HIGH SURF")] = "HIGH SURF"
storm$EVTYPE[toupper(storm$EVTYPE) %in%
c("WATERSPOUT/TORNADO"
)] = "WATERSPOUT"
storm$EVTYPE[toupper(storm$EVTYPE) %in%
c("WINTRY MIX",
"WINTER WEATHER MIX",
"WINTER WEATHER/MIX")] = "WINTER WEATHER"
storm$EVTYPE[toupper(storm$EVTYPE) %in%
c("WINTER STORMS"
)] = "WINTER STORM"
storm$EVTYPE[toupper(storm$EVTYPE) %in%
c("MARINE TSTM WIND"
)] = "MARINE THUNDERSTORM WIND"
When we analyze the data, we look at total numbers in aggregate. We don’t look at yearly averages because there are many significant events that happen only once every few years, for example massive hurricanes and floods.
We look at the number of people killed or hurt per event type, for the 50 most dangerous event types:
dangerous = as.data.frame.table(tail(sort(tapply(storm$TotalPeople, storm$EVTYPE, sum)), n=50))
colnames(dangerous) = c("EventType", "TotalPeople")
p1 = ggplot(data=dangerous, aes(x=EventType, y=TotalPeople)) +
theme(plot.margin=unit(c(1,1,-0.2,.91), "cm")) +
geom_bar(stat="identity") +
labs(x="", y="# People Killed/Injured")
p2 = p1 + scale_y_log10() +
theme(plot.margin=unit(c(-0.2,1,1,1), "cm")) +
theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
labs(y=expression(paste("(", log[10], ")"))) +
xlab("Type of Severe Weather Event")
p1 = p1 + theme(axis.text.x = element_blank(), axis.ticks.x = element_blank())
grid.arrange(p1, p2, nrow=2, main="Most Dangerous Event Types")
We can see from the skew of the graph that certain event types are the most dangerous to the health of the population, most notably tornadoes and heat. In fact, we have to also plot in logarithmic scale in order to be able to compare the less dangerous event types.
We now look at total property and crop damage per event type, for the 50 most costly event types.
ruinous = as.data.frame.table(tail(sort(tapply(storm$TotalDamage, storm$EVTYPE, sum)), n=50))
colnames(ruinous) = c("EventType", "TotalDamage")
p1 = ggplot(data=ruinous, aes(x=EventType, y=TotalDamage)) +
theme(plot.margin=unit(c(1,1,-0.2,.82), "cm")) +
geom_bar(stat="identity") +
labs(x="", y="Property/Crop Damage (USD)")
p2 = p1 + scale_y_log10() +
theme(plot.margin=unit(c(-0.2,1,1,1), "cm")) +
theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
labs(y=expression(paste("(", log[10], "USD)"))) +
xlab("Type of Severe Weather Event")
p1 = p1 + theme(axis.text.x = element_blank(), axis.ticks.x = element_blank())
grid.arrange(p1, p2, nrow=2, main="Most Costly Event Types")
As with dangerous event types, we can see from the skew of the graph that certain event types are disproportionately costly to the economy, most notably flood, hurricane, and storm surge. Once again, we have to also plot in logarithmic scale in order to be able to compare the less costly event types.
We also want to look at which event types are harmful both for people and property.
We narrow down to the top 10 in each dimension.
dangerous10 = tail(sort(tapply(storm$TotalPeople, storm$EVTYPE, sum)), n=10)
ruinous10 = tail(sort(tapply(storm$TotalDamage, storm$EVTYPE, sum)), n=10)
impactfulEvTypes = unique(c(names(ruinous10), names(dangerous10)))
We can now see which event types are particularly dangerous or particularly costy or both.
# Using dplyr
impact = storm %.%
group_by(EVTYPE) %.%
summarize(TotalDamage = sum(TotalDamage), TotalPeople = sum(TotalPeople))
## Warning: %.% is deprecated. Please use %>%
## Warning: %.% is deprecated. Please use %>%
impact[!(impact$EVTYPE %in% impactfulEvTypes), "EVTYPE"] = ''
ggplot(impact) +
aes(y=TotalPeople, x=TotalDamage, label=EVTYPE) +
geom_point() +
ggtitle("Dangerous & Costly Event Types") +
labs(y=expression(paste("# People Killed/Injured")),
x=expression(paste("Property/Crop damage (USD)"))) +
geom_text(angle=45, vjust=0, hjust=0, size=3) +
ylim(0, 25000)
## Warning in loop_apply(n, do.ply): Removed 1 rows containing missing values
## (geom_point).
## Warning in loop_apply(n, do.ply): Removed 1 rows containing missing values
## (geom_text).
From this plot, we can see that the most impactful event types are either very dangerous to the population or very costly, but generally not at the same time. For example, heat injures and kills many people but is not particularly costly. Similarly, storm surge is costly but is relatively harmless to people.
Tornadoes are the most dangerous to people but storm surges, hurricanes, and floods are most costly. Also, flooding is the most costly, but heat and tornadoes injure or kill more people.