Introduction

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

Data

The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file from the course web site:

There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.

The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.

What do we want to know?

  1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
  2. Across the United States, which types of events have the greatest economic consequences?

Summary

Data were downloaded according to the job specifications. The original base had to undergo some changes in the EVTYPE variable as there were subclassifications (11379 lines in total) and therefore were tuned to the top rank, resulting in 742 unique groups. Under these conditions the major fatalities and injuries are highlighted for TORNADO, TSTM WIND, LIGHTNING, FLOOD and EXCESSIVE HEAT. To assess the damage, the units were adjusted to the same order of magnitude, concluding that FLOOD, HURRICANE, TORNADO, STORM SURGE and HAIL are the ones that caused the greatest financial impact on the United States at all times. All details as in the tables, graphs and comments that follow in this document.

Data Processing

Let’s load and evaluate the structure of the database and see what we have about the variable types of events (EVTYPE).

# Get data
suppressWarnings({db = data.table::fread(loc)})
## 
Read 4.1% of 967216 rows
Read 23.8% of 967216 rows
Read 38.3% of 967216 rows
Read 51.7% of 967216 rows
Read 61.0% of 967216 rows
Read 74.4% of 967216 rows
Read 80.6% of 967216 rows
Read 87.9% of 967216 rows
Read 902297 rows and 37 (of 37) columns from 0.523 GB file in 00:00:10
set.seed(1234)
sample(db$EVTYPE, 20)
##  [1] "HAIL"                    "FLASH FLOOD"            
##  [3] "TSTM WIND"               "MARINE TSTM WIND"       
##  [5] "THUNDERSTORM WIND"       "EXTREME COLD/WIND CHILL"
##  [7] "HAIL"                    "HEAVY SNOW"             
##  [9] "TSTM WIND"               "HAIL"                   
## [11] "HAIL"                    "WINTER STORM"           
## [13] "DROUGHT"                 "FLOOD"                  
## [15] "HEAVY SNOW"              "HAIL"                   
## [17] "HAIL"                    "HAIL"                   
## [19] "TSTM WIND"               "THUNDERSTORM WINDS"

Apparently we have events with subclassifications, so let’s evaluate how many are in this condition.

paste0(sum(grepl("/", db$EVTYPE)), " events with subclassification.")
## [1] "11379 events with subclassification."
paste0("That represents ", round(100*(sum(grepl("/", db$EVTYPE))/nrow(db)), 2), "% of the database.")
## [1] "That represents 1.26% of the database."

Let’s put all the subclassified into your top grouping. For example: EXTREME COLD / WIND CHILL becomes EXTREME COLD only.

db$EVTYPE = gsub("/.*", "", db$EVTYPE)
db$EVTYPE = toupper(db$EVTYPE)
paste0(length(unique(db$EVTYPE)), " unique groups")
## [1] "742 unique groups"

RESULTS - Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

Responding to the first questioning, the events that brought the most destruction are those listed in the table: “The 10 events that caused the greatest destruction.”

inj.fat.total = aggregate(INJURIES + FATALITIES ~ EVTYPE, sum, data = db)
inj.fat.total = inj.fat.total[order(inj.fat.total$`INJURIES + FATALITIES`, decreasing = T), ]
inj.fat.top10 = inj.fat.total[1:10,]
kable(inj.fat.top10, align = "c", caption = "The 10 events that caused the greatest destruction", row.names = F)
The 10 events that caused the greatest destruction
EVTYPE INJURIES + FATALITIES
TORNADO 96979
EXCESSIVE HEAT 8428
TSTM WIND 7561
FLOOD 7292
LIGHTNING 6046
HEAT 3037
FLASH FLOOD 2769
ICE STORM 2066
THUNDERSTORM WIND 1621
WINTER STORM 1527
require(ggplot2)

med1 = median(inj.fat.top10$`INJURIES + FATALITIES`)
col1 = ifelse(inj.fat.top10$`INJURIES + FATALITIES` < med1, "tomato", "red3")

qplot(`INJURIES + FATALITIES`, EVTYPE, data = inj.fat.top10,
      main = "INJURIES + FATALITIES", ylab = "",
      xlab = "The median suggests a separator for extremely severe cases") +
    
    geom_text(aes(label = `INJURIES + FATALITIES`), 
              hjust = .5, vjust = -1, col = col1, size = 3) +
    
    geom_vline(xintercept = med1, col = "grey")

RESULTS - Across the United States, which types of events have the greatest economic consequences?

Let’s evaluate the economic effects of the weather. Here we have to make some adjustments of the order of magnitude of the variables “PROPDMG” and “CROPDMG”, because it is not with the unit standardized for evaluation. We will use the following conversion rule: h -> hundred, then x100, k -> thousand, then x1000, m -> million, then x1000000, b -> billion, then x1000000000. For the other symbols, because they are unknown, we will use no conversion scale.

db$PROPDMGEXP = toupper(db$PROPDMGEXP)
db$CROPDMGEXP = toupper(db$CROPDMGEXP)
kable(data.frame(table(db$PROPDMGEXP)), align = "c", caption = "Symbols used in PROPDMGEXP", col.names = c("Symbol", "Freq"))
Symbols used in PROPDMGEXP
Symbol Freq
465934
- 1
? 8
+ 5
0 216
1 25
2 13
3 4
4 4
5 28
6 4
7 5
8 1
B 40
H 7
K 424665
M 11337
kable(data.frame(table(db$CROPDMGEXP)), align = "c", caption = "Symbols used in CROPDMGEXP", col.names = c("Symbol", "Freq"))
Symbols used in CROPDMGEXP
Symbol Freq
618413
? 7
0 19
2 1
B 9
K 281853
M 1995
conversor = function(x) {
        obj = rep(NA, length(x))
        obj[x == "H"] = 100
        obj[x == "K"] = 1000
        obj[x == "M"] = 1000000
        obj[x == "B"] = 1000000000
        obj[x != "H" & x != "K" & x != "M" & x != "B"] = 1
        return(obj)
}
db$PROPDMGEXP2 = conversor(db$PROPDMGEXP) 
db$CROPDMGEXP2 = conversor(db$CROPDMGEXP)
db$PROPDMG2 = db$PROPDMG*db$PROPDMGEXP2
db$CROPDMG2 = db$CROPDMG*db$CROPDMGEXP2

The next step is to evaluate the losses generated by PROPDMG2 + CROPDMG2 ~ EVTYPE + BGN_DATE, that is, we will have a time series with the event classifier that occurred. We will restrict the evaluation to only the 10 cases that caused further destruction.

See in the figure “Losses by top 10 cases that caused further destruction” that the cases that bring, historically, more economic impact are: TORNADO, FLOOD and THUNDERSTORM WIND, arriving at the house of millions and billions. Out of curiosity, in the TORNADO class, the 1980s were by far the period of greatest loss.

It follows in the figure: “PROPDMG + CROPDMG (Billions of dollars in the U.S from 1950-2011)” that FLOOD, HURRICANE, TORNADO, STORM SURGE and HAIL are the ones that caused the greatest financial impact on the United States at all times.

suppressWarnings({db$BGN_DATE = lubridate::dmy_hms(db$BGN_DATE)})
eco.ty.ts = aggregate(PROPDMG2 + CROPDMG2 ~ EVTYPE + BGN_DATE, sum, data = db)
eco.ty.ts.top10 = subset(eco.ty.ts, eco.ty.ts$EVTYPE == inj.fat.top10$EVTYPE)
qplot(BGN_DATE, `PROPDMG2 + CROPDMG2`, facets = ~ EVTYPE, data = eco.ty.ts.top10, geom = "line", xlab = "Year", main = "Losses by top 10 cases that caused further\ndestruction in the U.S from 1950-2011")

eco.top10 = aggregate(PROPDMG2 + CROPDMG2 ~ EVTYPE, sum, data = db)
eco.top10 = eco.top10[order(eco.top10$`PROPDMG2 + CROPDMG2`, decreasing = T), ]
eco.top10 = eco.top10[1:10,]
kable(eco.top10, align = "c", caption = "The 10 events that caused the greatest economic losses", row.names = F)
The 10 events that caused the greatest economic losses
EVTYPE PROPDMG2 + CROPDMG2
FLOOD 150703066266
HURRICANE 86523941810
TORNADO 57352114049
STORM SURGE 47965579000
HAIL 18758772666
FLASH FLOOD 17835637173
DROUGHT 15018677780
RIVER FLOOD 10148404500
ICE STORM 8967041360
TROPICAL STORM 8382236550
med2 = median(eco.top10$`PROPDMG2 + CROPDMG2`)
col2 = ifelse(eco.top10$`PROPDMG2 + CROPDMG2` < med1, "tomato", "red3")
lab2 = round(eco.top10$`PROPDMG2 + CROPDMG2`/1000000000, 2)

qplot(`PROPDMG2 + CROPDMG2`, EVTYPE, data = eco.top10,
      main = "PROPDMG + CROPDMG\n(billions of dollars in the U.S from 1950-2011)", ylab = "", xlab = "The median suggests a separator for extremely severe cases") +
    
    geom_text(aes(label = lab2),
              hjust = .5, vjust = -1, col = col1, size = 3) +
    
    geom_vline(xintercept = med2, col = "grey")