NOAA 1995-2011 Storm Data Analysis

Synopsis

This analysis seeks to answer two main questions: What type of weather event causes the most economic damage? What type of weather event causes the most harm to human health? The dataset in use, from the National Oceanic and Atmospheric Association, is a collection of over 900 thousand storm observations dating back to 1950. Unfortunately, the data is not consistent until 1995, so all entries before then are not included in the analysis. There are five relevant variables that help answer the questions: event type, property damage, crop damage, injuries, and fatalities. After categorizing the event types, the data reveals that flood damage is by far the most destructive to property, topping $200 billion for the years included. The next highest point was hurricane/ocean damage, just under $100 million. Across the board damage to agricultre was much lower, never reaching $25 billion. Meanwhile, tornado and wind events topped injuries at over 22,000, while heat and dust events caused the most deaths.

Data Processing

Processing the data for analysis was unfortunately quite lengthy in this case. First of all, the event types variable is riddled with entry errors and inconsistencies. Secondly, the dataset is massive and contains much much more information than is needed for this analysis. Due to the limited abilities of the computer in use, much effort is put into paring down the dataset as much as possible. The first step in the code: load the necessary libraries and read in the data. In this case, the data file is already in the working directory.

library(dplyr)
library(scales)
library(ggplot2)

FullData <- read.csv("StormData.csv")

Next, a quick histogram shows that storm reports are regular from 1995 onwards. All data points before 1995 are removed - that shaves off about 85k observations. To codense the data set, all irrelevant variables are removed. The analysis is looking to sum damage numbers, so observation in which the damage is zero (for all four relevant variables) can be removed. Note that these value would still be necessary for, for example, examining the frequency of weather events or the $average$ damages from each event type. This analysis is only interested in total sums of damage, so zero values are not relevant.

#only recent, reliable data
FullData[,2] <- as.Date(FullData[,2], "%m/%d/%Y")

hist(FullData[,2], breaks = 60)

RecentData <- FullData[FullData$BGN_DATE > "1995-01-01",]

#remove irrelevant variables
workingData <- RecentData[,c(2, 8, 23:28, 36)]

#remove rows with zero values for all types of damage
empty <- workingData[which(workingData$PROPDMG==0 & workingData$CROPDMG==0 & workingData$INJURIES==0 & workingData$FATALITIES==0),]
workingData <- anti_join(workingData, empty)
rm(empty)

There were over 400 thousand weather events with zero damage, but there is another issue in processing the data: the damage values for property and crops are not all in the same notation. They each have an additional column that describes their digits, H for hundred, K for thousand, M for million, and B for billion. Although 35k and 6B are easier for humans to read than 3500 and 6000000000, they pose a problem for automated computations. The following code rewrites the damage values so they are all fully notated

#now I will standardize the propdmg and cropdmg variables according to their EXP values
#for PROPDMG 
hp <- which(workingData$PROPDMGEXP %in% c("H", "h"))
kp <- which(workingData$PROPDMGEXP %in% c("K", "k"))
mp <- which(workingData$PROPDMGEXP %in% c("M", "m"))
bp <- which(workingData$PROPDMGEXP %in% c("B", "b"))

workingData[hp, "PROPDMG"] = workingData[hp, "PROPDMG"]*100
workingData[kp, "PROPDMG"] = workingData[kp, "PROPDMG"]*1000
workingData[mp, "PROPDMG"] = workingData[mp, "PROPDMG"]*1000000
workingData[bp, "PROPDMG"] = workingData[bp, "PROPDMG"]*1000000000

#exact same process but for CROPDMG, there is no CROPDMGEXP %in% "H"
kc <- which(workingData$CROPDMGEXP %in% c("K", "k"))
mc <- which(workingData$CROPDMGEXP %in% c("M", "m"))
bc <- which(workingData$CROPDMGEXP %in% c("B", "b"))

workingData[kc, "CROPDMG"] = workingData[kc, "CROPDMG"]*1000
workingData[mc, "CROPDMG"] = workingData[mc, "CROPDMG"]*1000000
workingData[bc, "CROPDMG"] = workingData[bc, "CROPDMG"]*1000000000

#and remove these now-pointless variables
workingData <- workingData[,-c(6,8)]

Almost done pre-processing! The last part is categorizing the event types. After scanning the data, the following categories were deemed the most useful: winter/ice (includes snow, sleet, extremely low temperatures), heat (drought, extremely high temperatures), fire/volcanos, earthquake/land events (landslides, tremors, avalanche, etc), hurricanes and ocean events (including tsunamis, abnormal tides, waterspouts), thunderstorms (also rain, hail, fog, lightning), floods (for flooding not included in hurricane or thunderstorm events), and tornado/wind events.

#isolating snow/ice events
ind.winter <- grep("wint|snow|ice|blizzard|icy|cold|freez|frost|sleet|glaze|mix|hypothermia", workingData$EVTYPE, ignore.case = TRUE, value = FALSE)

#heat and dryness events
ind.heat <- grep("high temp|heat|warm|dry|dust|hot|drought|driest", workingData$EVTYPE, ignore.case = TRUE)

#fire and volcanos
ind.fire <- grep("fire|smoke|volcan|erupt", workingData$EVTYPE, ignore.case = TRUE)

#earthquake/landslide events
ind.earth <- grep("earthquake|quake|slide|rock|land|avalanche|mud|erosion", workingData$EVTYPE, ignore.case=TRUE)

#hurricanes and similar events
ind.hurri <- grep("hurricane|typhoon|water spout|waterspout|seas|tropical|coastal", workingData$EVTYPE, ignore.case = TRUE)
#the "DUST DEVIL WATERSPOUT" combo type is unclear what type it should be, so I will leave it with the hurricane and waterspout set

#thundestorms + other precipitation
ind.tstm <- grep("thunderstorm|tstm|tstorm|lightning|lighting|rain|hail|precip|shower|wet|fog|cloud|burst", workingData$EVTYPE, ignore.case = TRUE)

#next is floods, there is a lot of overlap between this category and the others i.e. "snow melt flooding"
ind.flood <- grep("flood|water|surf|tid|dam break|tsunami|swell|current|storm surge|wave|stream|marine|fld|water|seiche|drown", workingData$EVTYPE, ignore.case = TRUE)

#torndao and wind-only events
ind.torn <- grep("tornado|dust devil|wind|gust|funnel", workingData$EVTYPE, ignore.case = TRUE)

workingData <- mutate(workingData, CAT = NA)
workingData[ind.flood, "CAT"] = "Flood"
workingData[ind.torn, "CAT"] = "Tornado/Wind"
workingData[ind.winter, "CAT"] = "Winter/Cold"
workingData[ind.earth, "CAT"] = "Land/Rock"
workingData[ind.fire, "CAT"] = "Fire"
workingData[ind.hurri, "CAT"] = "Hurricane/Ocean"
workingData[ind.tstm, "CAT"] = "Thunderstorms/Precip."
workingData[ind.heat, "CAT"] = "Heat/Drought"

#how many lines are uncategorized?
sum(is.na(workingData$CAT))

## [1] 36

The processing is completely done! Damage notation is standardized, unnecessary variables removed, and events categorized.

Analysis

The economic damage reports only detail immediate damage from each weather event, i.e. thunderstorm winds felled a tree through a house. The property damage includes the tree and house damage, but not the domino economic effect of the homeowner taking off work to repair their house. Unfortunately this data does not contain those kinds of long-term observations, so the analysis is based on immediate damagen directly from weather forces.

The damage to human health is a bit more straightforward, although storms certainly can have long-term health consequences as seen in Puerto Rico in 2017-2018 in the aftermath of Hurricane Maria.

For each question, the workingData is used to created two smaller tables. One table lists the sum of injuries/fatalities per event type (“CAT” for category, in the code), and the other property damage/crop damage sums per event type. These tables are then graph with different shapes for each variable, and differenct colors for each event type category.

For the first analysis, which weather type caused the most injuries and fatalities?

#sum the injuries and fatalities per event type
pplsums <- aggregate(cbind(FATALITIES, INJURIES)~CAT, data = workingData, sum)

#graph
g1 <- ggplot(data = pplsums, aes(x = CAT, color = CAT))
plot(g1 +
       geom_point(aes(y = FATALITIES, shape = "Deaths"), size = 3) +
       geom_point(aes(y = INJURIES, shape = "Injuries"), size = 3) +
       theme(axis.text.x = element_blank()) +
       labs(title = "Population Harm by Event Type, 1995-2011") +
       labs(x = "Type", y = "Death/Injuries") +
       guides(color = guide_legend(title = NULL)) +
       guides(shape = guide_legend(title = NULL))
     )

Now the same process, but for the property and crop damage variables.

#sum damage per event type
propsums <- aggregate(cbind(PROPDMG, CROPDMG)~CAT, data = workingData, sum)

#graph
g2 <- ggplot(data = propsums, aes(x = CAT, color = CAT))
plot(g2 +
       scale_y_continuous(labels=dollar) +
       geom_point(aes(y = PROPDMG, shape = "Property"), size = 3) +
       geom_point(aes(y = CROPDMG, shape = "Agriculture"), size = 3) +
       theme(axis.text.x = element_blank(), axis.text.y = ) +
       labs(title = "Immediate Economic Damage by Event Type, 1995-2011") +
       labs(x = "Type", y = "Damage in Dollars") +
       guides(color = guide_legend(title = NULL)) +
       guides(shape = guide_legend(title = NULL))
      )

Results

Population Harm

These results are particularly interesting for the clear outlier values. In population harm, tornado and wind events caused over 22,500 injuries for the given period - roughly 10,000 than the next most dangerous type, thunderstorms and other non-winter precipitaiton. Thankfully the deaths variable is much lower across the board, with heat events (including droughts and extremely high temperatures) topping the list at over 2,500, with floods and tornados close behind.

Economic Damage

The chart for economic damage tells a very different story. Floods caused far and away the most propery damage, topping $200 billion for the given period. The next damager, hurricanes and other ocean events, didn’t quite reach $100 billion. Similar to the population data, in economic damage one variable is drastically lower than the other. In this case, agriculture is impacted much less than general property. The anamoly in the heat category, that agricutural damage is higher than property damage, is to be expected when considering the sensitivity of agriculture to temperature. But across all categories, agricultural damage did not reach $25 billion.

Further Questions

Although this data gives insights into the damage done by extreme weather events, there are still more questions that need to be answered before crafting a policy to address these damages. For example, tornados caused the most injuries - did the tornado events happen in the same region? What type of weather warning system is in that region? Are tornado shelters there legally mandated, or perhaps subsisdized by the government? The data tells which types of weather are the most damaging, but much more information is needed to work out counteracting that damage.