This document presents an analysis over the NOAA Storm dataset in order to find the natural events with the highest measured impact in population health and economics of areas affected, in the US. As for data processing I tried to minimize the size of the data frame I’d be working on from the start in order to simplify, while correcting the miscoding of those variables I’d work with (e.g. possible NA values, “?” or “+” in what should be numeric values for the exponent columns). I ended up with a data framed for each variable and ordered them so I could plot them in decreasing order and clearly show the result to the questions placed. The Data Processing section was done with a dynamic point of view in mind, taking the reader to explore the data alongside the code/analysis.
First we load in the dataset and run a quick check.
library(ggplot2)
library(grid)
library(gridExtra)
library(reshape2)
Storms <- read.csv("repdata-data-StormData.csv.bz2")
str(Storms)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels ""," Christiansburg",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels ""," CANTON"," TULIA",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","%SD",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436781 levels "","\t","\t\t",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
The data has 37 variables. Since we’ll be evaluating population health and economic consequences, it’s valuable to know if the table is fully filled in regarding the varibles: FATALITIES, INJURIES, PROPDMG, CROPDMG, PROPDMGEXP, CROPDMGEXP. These last two, seem to be exponent values to be applied to their receptive DMG columns. Regarding these, there are several unique values that do not exactly correspond to numbers - example: “?”. However, placing the correct exponent values - according to this link in the exponents columns - , I then check if it is all set.
I’ll first check for NA values in the DMGs columns, FATALITIES and INJURIES. I’ll then subset the table to include just these variables and trasform the exponents column to their correct values. Afterwards, I’ll create 2 news columns for the total value of property (PROPDMGTOT) and crops (CROPDMGTOT) damaged.
sum(is.na(Storms$FATALITIES))
## [1] 0
sum(is.na(Storms$INJURIES))
## [1] 0
sum(is.na(Storms$PROPDMG))
## [1] 0
sum(is.na(Storms$CROPDMG))
## [1] 0
postdata <- Storms[c(8, 23:28)]
sort(unique(postdata$PROPDMGEXP))
## [1] - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
## Levels: - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
sort(unique(postdata$CROPDMGEXP))
## [1] ? 0 2 B k K m M
## Levels: ? 0 2 B k K m M
table(postdata$CROPDMGEXP)
##
## ? 0 2 B k K m M
## 618413 7 19 1 9 21 281832 1 1994
postdata$CROPDMGEXP <- as.numeric(postdata$CROPDMGEXP)
postdata$CROPDMGEXP[postdata$CROPDMGEXP == "1" | postdata$CROPDMGEXP == "2"] <- 0
postdata$CROPDMGEXP[postdata$CROPDMGEXP == "3" | postdata$CROPDMGEXP == "4"] <- 1
postdata$CROPDMGEXP[postdata$CROPDMGEXP == "6" | postdata$CROPDMGEXP == "7"] <- 3
postdata$CROPDMGEXP[postdata$CROPDMGEXP == "8" | postdata$CROPDMGEXP == "9"] <- 6
postdata$CROPDMGEXP[postdata$CROPDMGEXP == "5"] <- 9
table(postdata$CROPDMGEXP)
##
## 0 1 3 6 9
## 618420 20 281853 1995 9
table(postdata$PROPDMGEXP)
##
## - ? + 0 1 2 3 4 5
## 465934 1 8 5 216 25 13 4 4 28
## 6 7 8 B h H K m M
## 4 5 1 40 1 6 424665 7 11330
postdata$PROPDMGEXP <- as.numeric(postdata$PROPDMGEXP)
postdata$PROPDMGEXP[postdata$PROPDMGEXP == "1" | postdata$PROPDMGEXP == "2" |
postdata$PROPDMGEXP == "3"] <- 0
postdata$PROPDMGEXP[postdata$PROPDMGEXP == "5" | postdata$PROPDMGEXP == "6" |
postdata$PROPDMGEXP == "7" | postdata$PROPDMGEXP == "8" | postdata$PROPDMGEXP ==
"9" | postdata$PROPDMGEXP == "10" | postdata$PROPDMGEXP == "11" | postdata$PROPDMGEXP ==
"12" | postdata$PROPDMGEXP == "13" | postdata$PROPDMGEXP == "4"] <- 1
postdata$PROPDMGEXP[postdata$PROPDMGEXP == "15" | postdata$PROPDMGEXP == "16"] <- 2
postdata$PROPDMGEXP[postdata$PROPDMGEXP == "17"] <- 3
postdata$PROPDMGEXP[postdata$PROPDMGEXP == "18" | postdata$PROPDMGEXP == "19"] <- 6
postdata$PROPDMGEXP[postdata$PROPDMGEXP == "14"] <- 9
table(postdata$PROPDMGEXP)
##
## 0 1 2 3 6 9
## 465943 305 7 424665 11337 40
postdata$PROPDMGTOT <- postdata$PROPDMG * 10^postdata$PROPDMGEXP
postdata$CROPDMGTOT <- postdata$CROPDMG * 10^postdata$CROPDMGEXP
Also, checking upon the EVTYPE variable and its various levels, it is clear that it needs some processing as Wind is coded as both WIND and WND, for example. Floods are also subcodedamong others, which is going disperse our data and affect its conclusions. I’ll then create a new variable called EVENT, with “OTHER” as the default for those categories that do not fit into the general ones, or have significantly lesser presence. Furthermore, I’ll broadly categorize EVTYPE into EVENT, joining together apparently equal levels. I’ll only show the 25 most common ones, for convenience sake.
sort(table(postdata$EVTYPE), decreasing = TRUE)[1:25]
##
## HAIL TSTM WIND THUNDERSTORM WIND
## 288661 219940 82563
## TORNADO FLASH FLOOD FLOOD
## 60652 54277 25326
## THUNDERSTORM WINDS HIGH WIND LIGHTNING
## 20843 20212 15754
## HEAVY SNOW HEAVY RAIN WINTER STORM
## 15708 11723 11433
## WINTER WEATHER FUNNEL CLOUD MARINE TSTM WIND
## 7026 6839 6175
## MARINE THUNDERSTORM WIND WATERSPOUT STRONG WIND
## 5812 3796 3566
## URBAN/SML STREAM FLD WILDFIRE BLIZZARD
## 3392 2761 2719
## DROUGHT ICE STORM EXCESSIVE HEAT
## 2488 2006 1678
## HIGH WINDS
## 1533
postdata$EVENT <- "OTHER"
postdata$EVENT[grep("HAIL", postdata$EVTYPE, ignore.case = TRUE)] <- "HAIL"
postdata$EVENT[grep("FLOOD", postdata$EVTYPE, ignore.case = TRUE)] <- "FLOOD"
postdata$EVENT[grep("HEAT", postdata$EVTYPE, ignore.case = TRUE)] <- "HEAT"
postdata$EVENT[grep("WARMTH", postdata$EVTYPE, ignore.case = TRUE)] <- "HEAT"
postdata$EVENT[grep("HOT", postdata$EVTYPE, ignore.case = TRUE)] <- "HEAT"
postdata$EVENT[grep("WARM", postdata$EVTYPE, ignore.case = TRUE)] <- "HEAT"
postdata$EVENT[grep("WIND", postdata$EVTYPE, ignore.case = TRUE)] <- "WIND"
postdata$EVENT[grep("WND", postdata$EVTYPE, ignore.case = TRUE)] <- "WIND"
postdata$EVENT[grep("WINTER", postdata$EVTYPE, ignore.case = TRUE)] <- "WINTER"
postdata$EVENT[grep("STORM", postdata$EVTYPE, ignore.case = TRUE)] <- "STORM"
postdata$EVENT[grep("SNOW", postdata$EVTYPE, ignore.case = TRUE)] <- "SNOW"
postdata$EVENT[grep("TORNADO", postdata$EVTYPE, ignore.case = TRUE)] <- "TORNADO"
postdata$EVENT[grep("BLIZZARD", postdata$EVTYPE, ignore.case = TRUE)] <- "BLIZZARD"
postdata$EVENT[grep("LIGHTNING", postdata$EVTYPE, ignore.case = TRUE)] <- "LIGHTNING"
postdata$EVENT[grep("RAIN", postdata$EVTYPE, ignore.case = TRUE)] <- "RAIN"
postdata$EVENT[grep("CLOUD", postdata$EVTYPE, ignore.case = TRUE)] <- "CLOUD"
postdata$EVENT[grep("WATERSPOUT", postdata$EVTYPE, ignore.case = TRUE)] <- "WATERSPOUT"
postdata$EVENT[grep("FIRE", postdata$EVTYPE, ignore.case = TRUE)] <- "FIRE"
postdata$EVENT[grep("DROUGHT", postdata$EVTYPE, ignore.case = TRUE)] <- "DROUGHT"
postdata$EVENT[grep("HURRICANE", postdata$EVTYPE, ignore.case = TRUE)] <- "HURRICANE/TYPHOON"
postdata$EVENT[grep("TYPHOON", postdata$EVTYPE, ignore.case = TRUE)] <- "HURRICANE/TYPHOON"
postdata$EVENT[grep("SURF", postdata$EVTYPE, ignore.case = TRUE)] <- "SURF"
postdata$EVENT[grep("FOG", postdata$EVTYPE, ignore.case = TRUE)] <- "FOG"
postdata$EVENT[grep("COLD", postdata$EVTYPE, ignore.case = TRUE)] <- "COLD"
postdata$EVENT[grep("COOL", postdata$EVTYPE, ignore.case = TRUE)] <- "COLD"
postdata$EVENT[grep("ICE", postdata$EVTYPE, ignore.case = TRUE)] <- "ICE"
postdata$EVENT[grep("ICY", postdata$EVTYPE, ignore.case = TRUE)] <- "ICE"
postdata$EVENT[grep("FROST", postdata$EVTYPE, ignore.case = TRUE)] <- "FREEZE"
postdata$EVENT[grep("FREEZE", postdata$EVTYPE, ignore.case = TRUE)] <- "FREEZE"
postdata$EVENT[grep("TIDE", postdata$EVTYPE, ignore.case = TRUE)] <- "TIDE"
postdata$EVENT[grep("CURRENT", postdata$EVTYPE, ignore.case = TRUE)] <- "TIDE"
postdata$EVENT[grep("STREAM", postdata$EVTYPE, ignore.case = TRUE)] <- "STREAM"
postdata$EVENT[grep("PRECIPITATION", postdata$EVTYPE, ignore.case = TRUE)] <- "PRECIPITATION"
postdata$EVENT[grep("MIXED PRECIP", postdata$EVTYPE, ignore.case = TRUE)] <- "PRECIPITATION"
postdata$EVENT[grep("DRIZZLE", postdata$EVTYPE, ignore.case = TRUE)] <- "PRECIPITATION"
postdata$EVENT[grep("AVALANCHE", postdata$EVTYPE, ignore.case = TRUE)] <- "AVALANCHE"
postdata$EVENT[grep("LANDSLIDE", postdata$EVTYPE, ignore.case = TRUE)] <- "LANDSLIDE"
postdata$EVENT[grep("SMOKE", postdata$EVTYPE, ignore.case = TRUE)] <- "SMOKE"
postdata$EVENT[grep("GLAZE", postdata$EVTYPE, ignore.case = TRUE)] <- "GLAZE"
postdata$EVENT[grep("DUST", postdata$EVTYPE, ignore.case = TRUE)] <- "DUST"
postdata$EVENT[grep("MUDSLIDE", postdata$EVTYPE, ignore.case = TRUE)] <- "MUDSLIDE"
postdata$EVENT[grep("MUD SLIDE", postdata$EVTYPE, ignore.case = TRUE)] <- "MUDSLIDE"
postdata$EVENT[grep("WET", postdata$EVTYPE, ignore.case = TRUE)] <- "WET"
postdata$EVENT[grep("DRY", postdata$EVTYPE, ignore.case = TRUE)] <- "DRY"
postdata$EVENT[grep("TSUNAMI", postdata$EVTYPE, ignore.case = TRUE)] <- "TSUNAMI"
postdata$EVENT[grep("VOLCANIC ASH", postdata$EVTYPE, ignore.case = TRUE)] <- "VOLCANIC ASH"
others <- sum(postdata$EVENT == "OTHER")
I took the time of going through all of the variables and try to categorize them all, since rare events such as tsunamis (20 ocurrences) may have a much greater impact in population health and damages than others that occur much more frequently but generally with a lesser severity. Past this categorisation, 634 entries alone remain classified as OTHER.
Afterwars, I created 4 tables, aggregates per event, relating to the questions proposed. I also created yet another table representing an interset of the
aggregatepropdamage <- aggregate(PROPDMGTOT ~ EVENT, postdata, FUN = sum)
aggregatecropdamage <- aggregate(CROPDMGTOT ~ EVENT, postdata, FUN = sum)
aggregatefatalities <- aggregate(FATALITIES ~ EVENT, postdata, FUN = sum)
aggregateinjuries <- aggregate(INJURIES ~ EVENT, postdata, FUN = sum)
aggregatepropdamage <- aggregatepropdamage[order(aggregatepropdamage$PROPDMGTOT,
decreasing = T), ]
aggregatecropdamage <- aggregatecropdamage[order(aggregatecropdamage$CROPDMGTOT,
decreasing = T), ]
aggregatefatalities <- aggregatefatalities[order(aggregatefatalities$FATALITIES,
decreasing = T), ]
aggregateinjuries <- aggregateinjuries[order(aggregateinjuries$INJURIES, decreasing = T),
]
dmg <- merge(head(aggregatecropdamage, 7), head(aggregatepropdamage, 7), by = "EVENT",
all = TRUE)
It is therefore time to answer the questions.
Since I ordered all tables by decreasing order, I’ll plot the graph with only the most substantial events for each of the variables we’d like to study.
g1 <- ggplot(data = head(aggregatefatalities, 5)) + aes(x = reorder(EVENT, FATALITIES),
y = FATALITIES, fill = EVENT) + geom_bar(stat = "sum") + coord_flip() +
theme(legend.position = "none") + xlab("Event") + ylab("Fatalities") + ggtitle("Total fatalities per event")
g2 <- ggplot(data = head(aggregateinjuries, 5)) + aes(x = reorder(EVENT, INJURIES),
y = INJURIES, fill = EVENT) + geom_bar(stat = "sum") + coord_flip() + theme(legend.position = "none") +
xlab("Event") + ylab("Injuries") + ggtitle("Total injuries per event")
grid.arrange(g1, g2, nrow = 2)
Population health impact of natural events
The answer seems to be pretty straightforward in both fatalities and injuries: Tornadoes.
processeddamage <- melt(dmg, id.vars = c("EVENT"))
processeddamage$variable <- as.numeric(processeddamage$variable)
processeddamage$variable[processeddamage$variable == "1"] <- "Crop Damage"
processeddamage$variable[processeddamage$variable == "2"] <- "Property Damage"
gg1 <- ggplot(processeddamage, aes(x = reorder(EVENT, value), y = value, fill = variable)) +
geom_bar(stat = "identity", position = "dodge") + coord_flip() + labs(x = "Event",
y = "Total Damage in USD", title = "Total Damage in USD per Event") + theme(plot.title = element_text(hjust = 0.5))
gg1
## Warning: Removed 6 rows containing missing values (geom_bar).
Financial impact of natural events
options(scipen = 10)
maxprop <- max(postdata$PROPDMGTOT)
maxcropflood <- max(which.max(postdata$PROPDMGTOT))
totalflooddmg <- maxprop + maxcropflood
The answer seems to be Floods with USD115000000000 and USD605953 in property and crop damages, respectively, totalling of USD115000605953 in damages.