Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file from the course web site:
There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
Data were downloaded according to the job specifications. The original base had to undergo some changes in the EVTYPE variable as there were subclassifications (11379 lines in total) and therefore were tuned to the top rank, resulting in 742 unique groups. Under these conditions the major fatalities and injuries are highlighted for TORNADO, TSTM WIND, LIGHTNING, FLOOD and EXCESSIVE HEAT. To assess the damage, the units were adjusted to the same order of magnitude, concluding that FLOOD, HURRICANE, TORNADO, STORM SURGE and HAIL are the ones that caused the greatest financial impact on the United States at all times. All details as in the tables, graphs and comments that follow in this document.
Let’s load and evaluate the structure of the database and see what we have about the variable types of events (EVTYPE).
# Get data
suppressWarnings({db = data.table::fread(loc)})
##
Read 4.1% of 967216 rows
Read 23.8% of 967216 rows
Read 38.3% of 967216 rows
Read 51.7% of 967216 rows
Read 61.0% of 967216 rows
Read 74.4% of 967216 rows
Read 80.6% of 967216 rows
Read 87.9% of 967216 rows
Read 902297 rows and 37 (of 37) columns from 0.523 GB file in 00:00:10
set.seed(1234)
sample(db$EVTYPE, 20)
## [1] "HAIL" "FLASH FLOOD"
## [3] "TSTM WIND" "MARINE TSTM WIND"
## [5] "THUNDERSTORM WIND" "EXTREME COLD/WIND CHILL"
## [7] "HAIL" "HEAVY SNOW"
## [9] "TSTM WIND" "HAIL"
## [11] "HAIL" "WINTER STORM"
## [13] "DROUGHT" "FLOOD"
## [15] "HEAVY SNOW" "HAIL"
## [17] "HAIL" "HAIL"
## [19] "TSTM WIND" "THUNDERSTORM WINDS"
Apparently we have events with subclassifications, so let’s evaluate how many are in this condition.
paste0(sum(grepl("/", db$EVTYPE)), " events with subclassification.")
## [1] "11379 events with subclassification."
paste0("That represents ", round(100*(sum(grepl("/", db$EVTYPE))/nrow(db)), 2), "% of the database.")
## [1] "That represents 1.26% of the database."
Let’s put all the subclassified into your top grouping. For example: EXTREME COLD / WIND CHILL becomes EXTREME COLD only.
db$EVTYPE = gsub("/.*", "", db$EVTYPE)
db$EVTYPE = toupper(db$EVTYPE)
paste0(length(unique(db$EVTYPE)), " unique groups")
## [1] "742 unique groups"
Responding to the first questioning, the events that brought the most destruction are those listed in the table: “The 10 events that caused the greatest destruction.”
inj.fat.total = aggregate(INJURIES + FATALITIES ~ EVTYPE, sum, data = db)
inj.fat.total = inj.fat.total[order(inj.fat.total$`INJURIES + FATALITIES`, decreasing = T), ]
inj.fat.top10 = inj.fat.total[1:10,]
kable(inj.fat.top10, align = "c", caption = "The 10 events that caused the greatest destruction", row.names = F)
| EVTYPE | INJURIES + FATALITIES |
|---|---|
| TORNADO | 96979 |
| EXCESSIVE HEAT | 8428 |
| TSTM WIND | 7561 |
| FLOOD | 7292 |
| LIGHTNING | 6046 |
| HEAT | 3037 |
| FLASH FLOOD | 2769 |
| ICE STORM | 2066 |
| THUNDERSTORM WIND | 1621 |
| WINTER STORM | 1527 |
require(ggplot2)
med1 = median(inj.fat.top10$`INJURIES + FATALITIES`)
col1 = ifelse(inj.fat.top10$`INJURIES + FATALITIES` < med1, "tomato", "red3")
qplot(`INJURIES + FATALITIES`, EVTYPE, data = inj.fat.top10,
main = "INJURIES + FATALITIES", ylab = "",
xlab = "The median suggests a separator for extremely severe cases") +
geom_text(aes(label = `INJURIES + FATALITIES`),
hjust = .5, vjust = -1, col = col1, size = 3) +
geom_vline(xintercept = med1, col = "grey")
Let’s evaluate the economic effects of the weather. Here we have to make some adjustments of the order of magnitude of the variables “PROPDMG” and “CROPDMG”, because it is not with the unit standardized for evaluation. We will use the following conversion rule: h -> hundred, then x100, k -> thousand, then x1000, m -> million, then x1000000, b -> billion, then x1000000000. For the other symbols, because they are unknown, we will use no conversion scale.
db$PROPDMGEXP = toupper(db$PROPDMGEXP)
db$CROPDMGEXP = toupper(db$CROPDMGEXP)
kable(data.frame(table(db$PROPDMGEXP)), align = "c", caption = "Symbols used in PROPDMGEXP", col.names = c("Symbol", "Freq"))
| Symbol | Freq |
|---|---|
| 465934 | |
| - | 1 |
| ? | 8 |
| + | 5 |
| 0 | 216 |
| 1 | 25 |
| 2 | 13 |
| 3 | 4 |
| 4 | 4 |
| 5 | 28 |
| 6 | 4 |
| 7 | 5 |
| 8 | 1 |
| B | 40 |
| H | 7 |
| K | 424665 |
| M | 11337 |
kable(data.frame(table(db$CROPDMGEXP)), align = "c", caption = "Symbols used in CROPDMGEXP", col.names = c("Symbol", "Freq"))
| Symbol | Freq |
|---|---|
| 618413 | |
| ? | 7 |
| 0 | 19 |
| 2 | 1 |
| B | 9 |
| K | 281853 |
| M | 1995 |
conversor = function(x) {
obj = rep(NA, length(x))
obj[x == "H"] = 100
obj[x == "K"] = 1000
obj[x == "M"] = 1000000
obj[x == "B"] = 1000000000
obj[x != "H" & x != "K" & x != "M" & x != "B"] = 1
return(obj)
}
db$PROPDMGEXP2 = conversor(db$PROPDMGEXP)
db$CROPDMGEXP2 = conversor(db$CROPDMGEXP)
db$PROPDMG2 = db$PROPDMG*db$PROPDMGEXP2
db$CROPDMG2 = db$CROPDMG*db$CROPDMGEXP2
The next step is to evaluate the losses generated by PROPDMG2 + CROPDMG2 ~ EVTYPE + BGN_DATE, that is, we will have a time series with the event classifier that occurred. We will restrict the evaluation to only the 10 cases that caused further destruction.
See in the figure “Losses by top 10 cases that caused further destruction” that the cases that bring, historically, more economic impact are: TORNADO, FLOOD and THUNDERSTORM WIND, arriving at the house of millions and billions. Out of curiosity, in the TORNADO class, the 1980s were by far the period of greatest loss.
It follows in the figure: “PROPDMG + CROPDMG (Billions of dollars in the U.S from 1950-2011)” that FLOOD, HURRICANE, TORNADO, STORM SURGE and HAIL are the ones that caused the greatest financial impact on the United States at all times.
suppressWarnings({db$BGN_DATE = lubridate::dmy_hms(db$BGN_DATE)})
eco.ty.ts = aggregate(PROPDMG2 + CROPDMG2 ~ EVTYPE + BGN_DATE, sum, data = db)
eco.ty.ts.top10 = subset(eco.ty.ts, eco.ty.ts$EVTYPE == inj.fat.top10$EVTYPE)
qplot(BGN_DATE, `PROPDMG2 + CROPDMG2`, facets = ~ EVTYPE, data = eco.ty.ts.top10, geom = "line", xlab = "Year", main = "Losses by top 10 cases that caused further\ndestruction in the U.S from 1950-2011")
eco.top10 = aggregate(PROPDMG2 + CROPDMG2 ~ EVTYPE, sum, data = db)
eco.top10 = eco.top10[order(eco.top10$`PROPDMG2 + CROPDMG2`, decreasing = T), ]
eco.top10 = eco.top10[1:10,]
kable(eco.top10, align = "c", caption = "The 10 events that caused the greatest economic losses", row.names = F)
| EVTYPE | PROPDMG2 + CROPDMG2 |
|---|---|
| FLOOD | 150703066266 |
| HURRICANE | 86523941810 |
| TORNADO | 57352114049 |
| STORM SURGE | 47965579000 |
| HAIL | 18758772666 |
| FLASH FLOOD | 17835637173 |
| DROUGHT | 15018677780 |
| RIVER FLOOD | 10148404500 |
| ICE STORM | 8967041360 |
| TROPICAL STORM | 8382236550 |
med2 = median(eco.top10$`PROPDMG2 + CROPDMG2`)
col2 = ifelse(eco.top10$`PROPDMG2 + CROPDMG2` < med1, "tomato", "red3")
lab2 = round(eco.top10$`PROPDMG2 + CROPDMG2`/1000000000, 2)
qplot(`PROPDMG2 + CROPDMG2`, EVTYPE, data = eco.top10,
main = "PROPDMG + CROPDMG\n(billions of dollars in the U.S from 1950-2011)", ylab = "", xlab = "The median suggests a separator for extremely severe cases") +
geom_text(aes(label = lab2),
hjust = .5, vjust = -1, col = col1, size = 3) +
geom_vline(xintercept = med2, col = "grey")