NOAA Storm Data Base - Simple Data Analysis

Introduction

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

Data

The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file from the course web site:

Storm Data [47Mb]

There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.

National Weather Service Storm Data Documentation
National Climatic Data Center Storm Events FAQ

The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.

What do we want to know?

Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?

Summary

Data were downloaded according to the job specifications. The original base had to undergo some changes in the EVTYPE variable as there were subclassifications (11379 lines in total) and therefore were tuned to the top rank, resulting in 742 unique groups. Under these conditions the major fatalities and injuries are highlighted for TORNADO, TSTM WIND, LIGHTNING, FLOOD and EXCESSIVE HEAT. To assess the damage, the units were adjusted to the same order of magnitude, concluding that FLOOD, HURRICANE, TORNADO, STORM SURGE and HAIL are the ones that caused the greatest financial impact on the United States at all times. All details as in the tables, graphs and comments that follow in this document.

Data Processing

Let’s load and evaluate the structure of the database and see what we have about the variable types of events (EVTYPE).

# Get data
suppressWarnings({db = data.table::fread(loc)})

## 
Read 4.1% of 967216 rows
Read 23.8% of 967216 rows
Read 38.3% of 967216 rows
Read 51.7% of 967216 rows
Read 61.0% of 967216 rows
Read 74.4% of 967216 rows
Read 80.6% of 967216 rows
Read 87.9% of 967216 rows
Read 902297 rows and 37 (of 37) columns from 0.523 GB file in 00:00:10

set.seed(1234)
sample(db$EVTYPE, 20)

##  [1] "HAIL"                    "FLASH FLOOD"            
##  [3] "TSTM WIND"               "MARINE TSTM WIND"       
##  [5] "THUNDERSTORM WIND"       "EXTREME COLD/WIND CHILL"
##  [7] "HAIL"                    "HEAVY SNOW"             
##  [9] "TSTM WIND"               "HAIL"                   
## [11] "HAIL"                    "WINTER STORM"           
## [13] "DROUGHT"                 "FLOOD"                  
## [15] "HEAVY SNOW"              "HAIL"                   
## [17] "HAIL"                    "HAIL"                   
## [19] "TSTM WIND"               "THUNDERSTORM WINDS"

Apparently we have events with subclassifications, so let’s evaluate how many are in this condition.

paste0(sum(grepl("/", db$EVTYPE)), " events with subclassification.")

## [1] "11379 events with subclassification."

paste0("That represents ", round(100*(sum(grepl("/", db$EVTYPE))/nrow(db)), 2), "% of the database.")

## [1] "That represents 1.26% of the database."

Let’s put all the subclassified into your top grouping. For example: EXTREME COLD / WIND CHILL becomes EXTREME COLD only.

db$EVTYPE = gsub("/.*", "", db$EVTYPE)
db$EVTYPE = toupper(db$EVTYPE)
paste0(length(unique(db$EVTYPE)), " unique groups")

## [1] "742 unique groups"

RESULTS - Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

Responding to the first questioning, the events that brought the most destruction are those listed in the table: “The 10 events that caused the greatest destruction.”

inj.fat.total = aggregate(INJURIES + FATALITIES ~ EVTYPE, sum, data = db)
inj.fat.total = inj.fat.total[order(inj.fat.total$`INJURIES + FATALITIES`, decreasing = T), ]
inj.fat.top10 = inj.fat.total[1:10,]
kable(inj.fat.top10, align = "c", caption = "The 10 events that caused the greatest destruction", row.names = F)

The 10 events that caused the greatest destruction
EVTYPE	INJURIES + FATALITIES
TORNADO	96979
EXCESSIVE HEAT	8428
TSTM WIND	7561
FLOOD	7292
LIGHTNING	6046
HEAT	3037
FLASH FLOOD	2769
ICE STORM	2066
THUNDERSTORM WIND	1621
WINTER STORM	1527

require(ggplot2)

med1 = median(inj.fat.top10$`INJURIES + FATALITIES`)
col1 = ifelse(inj.fat.top10$`INJURIES + FATALITIES` < med1, "tomato", "red3")

qplot(`INJURIES + FATALITIES`, EVTYPE, data = inj.fat.top10,
      main = "INJURIES + FATALITIES", ylab = "",
      xlab = "The median suggests a separator for extremely severe cases") +
    
    geom_text(aes(label = `INJURIES + FATALITIES`), 
              hjust = .5, vjust = -1, col = col1, size = 3) +
    
    geom_vline(xintercept = med1, col = "grey")

RESULTS - Across the United States, which types of events have the greatest economic consequences?

Let’s evaluate the economic effects of the weather. Here we have to make some adjustments of the order of magnitude of the variables “PROPDMG” and “CROPDMG”, because it is not with the unit standardized for evaluation. We will use the following conversion rule: h -> hundred, then x100, k -> thousand, then x1000, m -> million, then x1000000, b -> billion, then x1000000000. For the other symbols, because they are unknown, we will use no conversion scale.

db$PROPDMGEXP = toupper(db$PROPDMGEXP)
db$CROPDMGEXP = toupper(db$CROPDMGEXP)
kable(data.frame(table(db$PROPDMGEXP)), align = "c", caption = "Symbols used in PROPDMGEXP", col.names = c("Symbol", "Freq"))

Symbols used in PROPDMGEXP
Symbol	Freq
	465934
-	1
?	8
+	5
0	216
1	25
2	13
3	4
4	4
5	28
6	4
7	5
8	1
B	40
H	7
K	424665
M	11337

kable(data.frame(table(db$CROPDMGEXP)), align = "c", caption = "Symbols used in CROPDMGEXP", col.names = c("Symbol", "Freq"))

Symbols used in CROPDMGEXP
Symbol	Freq
	618413
?	7
0	19
2	1
B	9
K	281853
M	1995

conversor = function(x) {
        obj = rep(NA, length(x))
        obj[x == "H"] = 100
        obj[x == "K"] = 1000
        obj[x == "M"] = 1000000
        obj[x == "B"] = 1000000000
        obj[x != "H" & x != "K" & x != "M" & x != "B"] = 1
        return(obj)
}
db$PROPDMGEXP2 = conversor(db$PROPDMGEXP) 
db$CROPDMGEXP2 = conversor(db$CROPDMGEXP)
db$PROPDMG2 = db$PROPDMG*db$PROPDMGEXP2
db$CROPDMG2 = db$CROPDMG*db$CROPDMGEXP2

The next step is to evaluate the losses generated by PROPDMG2 + CROPDMG2 ~ EVTYPE + BGN_DATE, that is, we will have a time series with the event classifier that occurred. We will restrict the evaluation to only the 10 cases that caused further destruction.

See in the figure “Losses by top 10 cases that caused further destruction” that the cases that bring, historically, more economic impact are: TORNADO, FLOOD and THUNDERSTORM WIND, arriving at the house of millions and billions. Out of curiosity, in the TORNADO class, the 1980s were by far the period of greatest loss.

It follows in the figure: “PROPDMG + CROPDMG (Billions of dollars in the U.S from 1950-2011)” that FLOOD, HURRICANE, TORNADO, STORM SURGE and HAIL are the ones that caused the greatest financial impact on the United States at all times.

suppressWarnings({db$BGN_DATE = lubridate::dmy_hms(db$BGN_DATE)})
eco.ty.ts = aggregate(PROPDMG2 + CROPDMG2 ~ EVTYPE + BGN_DATE, sum, data = db)
eco.ty.ts.top10 = subset(eco.ty.ts, eco.ty.ts$EVTYPE == inj.fat.top10$EVTYPE)
qplot(BGN_DATE, `PROPDMG2 + CROPDMG2`, facets = ~ EVTYPE, data = eco.ty.ts.top10, geom = "line", xlab = "Year", main = "Losses by top 10 cases that caused further\ndestruction in the U.S from 1950-2011")

eco.top10 = aggregate(PROPDMG2 + CROPDMG2 ~ EVTYPE, sum, data = db)
eco.top10 = eco.top10[order(eco.top10$`PROPDMG2 + CROPDMG2`, decreasing = T), ]
eco.top10 = eco.top10[1:10,]
kable(eco.top10, align = "c", caption = "The 10 events that caused the greatest economic losses", row.names = F)

The 10 events that caused the greatest economic losses
EVTYPE	PROPDMG2 + CROPDMG2
FLOOD	150703066266
HURRICANE	86523941810
TORNADO	57352114049
STORM SURGE	47965579000
HAIL	18758772666
FLASH FLOOD	17835637173
DROUGHT	15018677780
RIVER FLOOD	10148404500
ICE STORM	8967041360
TROPICAL STORM	8382236550

med2 = median(eco.top10$`PROPDMG2 + CROPDMG2`)
col2 = ifelse(eco.top10$`PROPDMG2 + CROPDMG2` < med1, "tomato", "red3")
lab2 = round(eco.top10$`PROPDMG2 + CROPDMG2`/1000000000, 2)

qplot(`PROPDMG2 + CROPDMG2`, EVTYPE, data = eco.top10,
      main = "PROPDMG + CROPDMG\n(billions of dollars in the U.S from 1950-2011)", ylab = "", xlab = "The median suggests a separator for extremely severe cases") +
    
    geom_text(aes(label = lab2),
              hjust = .5, vjust = -1, col = col1, size = 3) +
    
    geom_vline(xintercept = med2, col = "grey")