Synopsis

This document presents an analysis over the NOAA Storm dataset in order to find the natural events with the highest measured impact in population health and economics of areas affected, in the US. As for data processing I tried to minimize the size of the data frame I’d be working on from the start in order to simplify, while correcting the miscoding of those variables I’d work with (e.g. possible NA values, “?” or “+” in what should be numeric values for the exponent columns). I ended up with a data framed for each variable and ordered them so I could plot them in decreasing order and clearly show the result to the questions placed. The Data Processing section was done with a dynamic point of view in mind, taking the reader to explore the data alongside the code/analysis.

Data Processing

First we load in the dataset and run a quick check.

library(ggplot2)
library(grid)
library(gridExtra)
library(reshape2)
Storms <- read.csv("repdata-data-StormData.csv.bz2")
str(Storms)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","  N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels ""," Christiansburg",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels ""," CANTON"," TULIA",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels ""," CI","%SD",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436781 levels "","\t","\t\t",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

The data has 37 variables. Since we’ll be evaluating population health and economic consequences, it’s valuable to know if the table is fully filled in regarding the varibles: FATALITIES, INJURIES, PROPDMG, CROPDMG, PROPDMGEXP, CROPDMGEXP. These last two, seem to be exponent values to be applied to their receptive DMG columns. Regarding these, there are several unique values that do not exactly correspond to numbers - example: “?”. However, placing the correct exponent values - according to this link in the exponents columns - , I then check if it is all set.

I’ll first check for NA values in the DMGs columns, FATALITIES and INJURIES. I’ll then subset the table to include just these variables and trasform the exponents column to their correct values. Afterwards, I’ll create 2 news columns for the total value of property (PROPDMGTOT) and crops (CROPDMGTOT) damaged.

sum(is.na(Storms$FATALITIES))
## [1] 0
sum(is.na(Storms$INJURIES))
## [1] 0
sum(is.na(Storms$PROPDMG))
## [1] 0
sum(is.na(Storms$CROPDMG))
## [1] 0
postdata <- Storms[c(8, 23:28)]
sort(unique(postdata$PROPDMGEXP))
##  [1]   - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
## Levels:  - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
sort(unique(postdata$CROPDMGEXP))
## [1]   ? 0 2 B k K m M
## Levels:  ? 0 2 B k K m M
table(postdata$CROPDMGEXP)
## 
##             ?      0      2      B      k      K      m      M 
## 618413      7     19      1      9     21 281832      1   1994
postdata$CROPDMGEXP <- as.numeric(postdata$CROPDMGEXP)
postdata$CROPDMGEXP[postdata$CROPDMGEXP == "1" | postdata$CROPDMGEXP == "2"] <- 0
postdata$CROPDMGEXP[postdata$CROPDMGEXP == "3" | postdata$CROPDMGEXP == "4"] <- 1
postdata$CROPDMGEXP[postdata$CROPDMGEXP == "6" | postdata$CROPDMGEXP == "7"] <- 3
postdata$CROPDMGEXP[postdata$CROPDMGEXP == "8" | postdata$CROPDMGEXP == "9"] <- 6
postdata$CROPDMGEXP[postdata$CROPDMGEXP == "5"] <- 9
table(postdata$CROPDMGEXP)
## 
##      0      1      3      6      9 
## 618420     20 281853   1995      9
table(postdata$PROPDMGEXP)
## 
##             -      ?      +      0      1      2      3      4      5 
## 465934      1      8      5    216     25     13      4      4     28 
##      6      7      8      B      h      H      K      m      M 
##      4      5      1     40      1      6 424665      7  11330
postdata$PROPDMGEXP <- as.numeric(postdata$PROPDMGEXP)
postdata$PROPDMGEXP[postdata$PROPDMGEXP == "1" | postdata$PROPDMGEXP == "2" | 
    postdata$PROPDMGEXP == "3"] <- 0
postdata$PROPDMGEXP[postdata$PROPDMGEXP == "5" | postdata$PROPDMGEXP == "6" | 
    postdata$PROPDMGEXP == "7" | postdata$PROPDMGEXP == "8" | postdata$PROPDMGEXP == 
    "9" | postdata$PROPDMGEXP == "10" | postdata$PROPDMGEXP == "11" | postdata$PROPDMGEXP == 
    "12" | postdata$PROPDMGEXP == "13" | postdata$PROPDMGEXP == "4"] <- 1
postdata$PROPDMGEXP[postdata$PROPDMGEXP == "15" | postdata$PROPDMGEXP == "16"] <- 2
postdata$PROPDMGEXP[postdata$PROPDMGEXP == "17"] <- 3
postdata$PROPDMGEXP[postdata$PROPDMGEXP == "18" | postdata$PROPDMGEXP == "19"] <- 6
postdata$PROPDMGEXP[postdata$PROPDMGEXP == "14"] <- 9
table(postdata$PROPDMGEXP)
## 
##      0      1      2      3      6      9 
## 465943    305      7 424665  11337     40
postdata$PROPDMGTOT <- postdata$PROPDMG * 10^postdata$PROPDMGEXP
postdata$CROPDMGTOT <- postdata$CROPDMG * 10^postdata$CROPDMGEXP

Also, checking upon the EVTYPE variable and its various levels, it is clear that it needs some processing as Wind is coded as both WIND and WND, for example. Floods are also subcodedamong others, which is going disperse our data and affect its conclusions. I’ll then create a new variable called EVENT, with “OTHER” as the default for those categories that do not fit into the general ones, or have significantly lesser presence. Furthermore, I’ll broadly categorize EVTYPE into EVENT, joining together apparently equal levels. I’ll only show the 25 most common ones, for convenience sake.

sort(table(postdata$EVTYPE), decreasing = TRUE)[1:25]
## 
##                     HAIL                TSTM WIND        THUNDERSTORM WIND 
##                   288661                   219940                    82563 
##                  TORNADO              FLASH FLOOD                    FLOOD 
##                    60652                    54277                    25326 
##       THUNDERSTORM WINDS                HIGH WIND                LIGHTNING 
##                    20843                    20212                    15754 
##               HEAVY SNOW               HEAVY RAIN             WINTER STORM 
##                    15708                    11723                    11433 
##           WINTER WEATHER             FUNNEL CLOUD         MARINE TSTM WIND 
##                     7026                     6839                     6175 
## MARINE THUNDERSTORM WIND               WATERSPOUT              STRONG WIND 
##                     5812                     3796                     3566 
##     URBAN/SML STREAM FLD                 WILDFIRE                 BLIZZARD 
##                     3392                     2761                     2719 
##                  DROUGHT                ICE STORM           EXCESSIVE HEAT 
##                     2488                     2006                     1678 
##               HIGH WINDS 
##                     1533
postdata$EVENT <- "OTHER"
postdata$EVENT[grep("HAIL", postdata$EVTYPE, ignore.case = TRUE)] <- "HAIL"
postdata$EVENT[grep("FLOOD", postdata$EVTYPE, ignore.case = TRUE)] <- "FLOOD"
postdata$EVENT[grep("HEAT", postdata$EVTYPE, ignore.case = TRUE)] <- "HEAT"
postdata$EVENT[grep("WARMTH", postdata$EVTYPE, ignore.case = TRUE)] <- "HEAT"
postdata$EVENT[grep("HOT", postdata$EVTYPE, ignore.case = TRUE)] <- "HEAT"
postdata$EVENT[grep("WARM", postdata$EVTYPE, ignore.case = TRUE)] <- "HEAT"
postdata$EVENT[grep("WIND", postdata$EVTYPE, ignore.case = TRUE)] <- "WIND"
postdata$EVENT[grep("WND", postdata$EVTYPE, ignore.case = TRUE)] <- "WIND"
postdata$EVENT[grep("WINTER", postdata$EVTYPE, ignore.case = TRUE)] <- "WINTER"
postdata$EVENT[grep("STORM", postdata$EVTYPE, ignore.case = TRUE)] <- "STORM"
postdata$EVENT[grep("SNOW", postdata$EVTYPE, ignore.case = TRUE)] <- "SNOW"
postdata$EVENT[grep("TORNADO", postdata$EVTYPE, ignore.case = TRUE)] <- "TORNADO"
postdata$EVENT[grep("BLIZZARD", postdata$EVTYPE, ignore.case = TRUE)] <- "BLIZZARD"
postdata$EVENT[grep("LIGHTNING", postdata$EVTYPE, ignore.case = TRUE)] <- "LIGHTNING"
postdata$EVENT[grep("RAIN", postdata$EVTYPE, ignore.case = TRUE)] <- "RAIN"
postdata$EVENT[grep("CLOUD", postdata$EVTYPE, ignore.case = TRUE)] <- "CLOUD"
postdata$EVENT[grep("WATERSPOUT", postdata$EVTYPE, ignore.case = TRUE)] <- "WATERSPOUT"
postdata$EVENT[grep("FIRE", postdata$EVTYPE, ignore.case = TRUE)] <- "FIRE"
postdata$EVENT[grep("DROUGHT", postdata$EVTYPE, ignore.case = TRUE)] <- "DROUGHT"
postdata$EVENT[grep("HURRICANE", postdata$EVTYPE, ignore.case = TRUE)] <- "HURRICANE/TYPHOON"
postdata$EVENT[grep("TYPHOON", postdata$EVTYPE, ignore.case = TRUE)] <- "HURRICANE/TYPHOON"
postdata$EVENT[grep("SURF", postdata$EVTYPE, ignore.case = TRUE)] <- "SURF"
postdata$EVENT[grep("FOG", postdata$EVTYPE, ignore.case = TRUE)] <- "FOG"
postdata$EVENT[grep("COLD", postdata$EVTYPE, ignore.case = TRUE)] <- "COLD"
postdata$EVENT[grep("COOL", postdata$EVTYPE, ignore.case = TRUE)] <- "COLD"
postdata$EVENT[grep("ICE", postdata$EVTYPE, ignore.case = TRUE)] <- "ICE"
postdata$EVENT[grep("ICY", postdata$EVTYPE, ignore.case = TRUE)] <- "ICE"
postdata$EVENT[grep("FROST", postdata$EVTYPE, ignore.case = TRUE)] <- "FREEZE"
postdata$EVENT[grep("FREEZE", postdata$EVTYPE, ignore.case = TRUE)] <- "FREEZE"
postdata$EVENT[grep("TIDE", postdata$EVTYPE, ignore.case = TRUE)] <- "TIDE"
postdata$EVENT[grep("CURRENT", postdata$EVTYPE, ignore.case = TRUE)] <- "TIDE"
postdata$EVENT[grep("STREAM", postdata$EVTYPE, ignore.case = TRUE)] <- "STREAM"
postdata$EVENT[grep("PRECIPITATION", postdata$EVTYPE, ignore.case = TRUE)] <- "PRECIPITATION"
postdata$EVENT[grep("MIXED PRECIP", postdata$EVTYPE, ignore.case = TRUE)] <- "PRECIPITATION"
postdata$EVENT[grep("DRIZZLE", postdata$EVTYPE, ignore.case = TRUE)] <- "PRECIPITATION"
postdata$EVENT[grep("AVALANCHE", postdata$EVTYPE, ignore.case = TRUE)] <- "AVALANCHE"
postdata$EVENT[grep("LANDSLIDE", postdata$EVTYPE, ignore.case = TRUE)] <- "LANDSLIDE"
postdata$EVENT[grep("SMOKE", postdata$EVTYPE, ignore.case = TRUE)] <- "SMOKE"
postdata$EVENT[grep("GLAZE", postdata$EVTYPE, ignore.case = TRUE)] <- "GLAZE"
postdata$EVENT[grep("DUST", postdata$EVTYPE, ignore.case = TRUE)] <- "DUST"
postdata$EVENT[grep("MUDSLIDE", postdata$EVTYPE, ignore.case = TRUE)] <- "MUDSLIDE"
postdata$EVENT[grep("MUD SLIDE", postdata$EVTYPE, ignore.case = TRUE)] <- "MUDSLIDE"
postdata$EVENT[grep("WET", postdata$EVTYPE, ignore.case = TRUE)] <- "WET"
postdata$EVENT[grep("DRY", postdata$EVTYPE, ignore.case = TRUE)] <- "DRY"
postdata$EVENT[grep("TSUNAMI", postdata$EVTYPE, ignore.case = TRUE)] <- "TSUNAMI"
postdata$EVENT[grep("VOLCANIC ASH", postdata$EVTYPE, ignore.case = TRUE)] <- "VOLCANIC ASH"
others <- sum(postdata$EVENT == "OTHER")

I took the time of going through all of the variables and try to categorize them all, since rare events such as tsunamis (20 ocurrences) may have a much greater impact in population health and damages than others that occur much more frequently but generally with a lesser severity. Past this categorisation, 634 entries alone remain classified as OTHER.

Afterwars, I created 4 tables, aggregates per event, relating to the questions proposed. I also created yet another table representing an interset of the

aggregatepropdamage <- aggregate(PROPDMGTOT ~ EVENT, postdata, FUN = sum)
aggregatecropdamage <- aggregate(CROPDMGTOT ~ EVENT, postdata, FUN = sum)
aggregatefatalities <- aggregate(FATALITIES ~ EVENT, postdata, FUN = sum)
aggregateinjuries <- aggregate(INJURIES ~ EVENT, postdata, FUN = sum)
aggregatepropdamage <- aggregatepropdamage[order(aggregatepropdamage$PROPDMGTOT, 
    decreasing = T), ]
aggregatecropdamage <- aggregatecropdamage[order(aggregatecropdamage$CROPDMGTOT, 
    decreasing = T), ]
aggregatefatalities <- aggregatefatalities[order(aggregatefatalities$FATALITIES, 
    decreasing = T), ]
aggregateinjuries <- aggregateinjuries[order(aggregateinjuries$INJURIES, decreasing = T), 
    ]
dmg <- merge(head(aggregatecropdamage, 7), head(aggregatepropdamage, 7), by = "EVENT", 
    all = TRUE)

Results

It is therefore time to answer the questions.

Which event is the most harmful for population health?

Since I ordered all tables by decreasing order, I’ll plot the graph with only the most substantial events for each of the variables we’d like to study.

g1 <- ggplot(data = head(aggregatefatalities, 5)) + aes(x = reorder(EVENT, FATALITIES), 
    y = FATALITIES, fill = EVENT) + geom_bar(stat = "sum") + coord_flip() + 
    theme(legend.position = "none") + xlab("Event") + ylab("Fatalities") + ggtitle("Total fatalities per event")
g2 <- ggplot(data = head(aggregateinjuries, 5)) + aes(x = reorder(EVENT, INJURIES), 
    y = INJURIES, fill = EVENT) + geom_bar(stat = "sum") + coord_flip() + theme(legend.position = "none") + 
    xlab("Event") + ylab("Injuries") + ggtitle("Total injuries per event")
grid.arrange(g1, g2, nrow = 2)
Population health impact of natural events

Population health impact of natural events

The answer seems to be pretty straightforward in both fatalities and injuries: Tornadoes.

Which event has the most direct economic impact?

processeddamage <- melt(dmg, id.vars = c("EVENT"))
processeddamage$variable <- as.numeric(processeddamage$variable)
processeddamage$variable[processeddamage$variable == "1"] <- "Crop Damage"
processeddamage$variable[processeddamage$variable == "2"] <- "Property Damage"
gg1 <- ggplot(processeddamage, aes(x = reorder(EVENT, value), y = value, fill = variable)) + 
    geom_bar(stat = "identity", position = "dodge") + coord_flip() + labs(x = "Event", 
    y = "Total Damage in USD", title = "Total Damage in USD per Event") + theme(plot.title = element_text(hjust = 0.5))
gg1
## Warning: Removed 6 rows containing missing values (geom_bar).
Financial impact of natural events

Financial impact of natural events

options(scipen = 10)
maxprop <- max(postdata$PROPDMGTOT)
maxcropflood <- max(which.max(postdata$PROPDMGTOT))
totalflooddmg <- maxprop + maxcropflood

The answer seems to be Floods with USD115000000000 and USD605953 in property and crop damages, respectively, totalling of USD115000605953 in damages.