For this study the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database is used. With this its possibel to examine the effect that severe weather events have in the U.S. The main goal will be to show which weather events that are most harmfull to the population, and which events that have the greatest economic consequences. There will also be a short notice, on which subjects could benefit from further analysis. The data processing is included for reproducibility, so feel free to skip to the results.
# packages used
library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
##
## De følgende objekter er maskerede fra 'package:stats':
##
## filter, lag
##
## De følgende objekter er maskerede fra 'package:base':
##
## intersect, setdiff, setequal, union
library(lubridate)
library(reshape2)
library(ggplot2)
# read the NOAA data
storm.data <- read_csv("repdata-data-StormData.csv.bz2")
The purpose of this study is to analyse the most harmful events, and the economic consequences, so the data gets minimized to the variables that will help in that study.
storm.data <- storm.data %>% select(one_of("BGN_DATE", "EVTYPE", "FATALITIES", "INJURIES", "PROPDMG",
"PROPDMGEXP", "CROPDMG", "CROPDMGEXP"))
storm.data$BGN_DATE <- as.Date(storm.data$BGN_DATE, "%m/%d/%Y")
# show histogram
hist(year(storm.data$BGN_DATE), xlab="year", main="Recorded weather events", breaks=60)
The NOAA database contains data back to 1950, but in the histogram we can see that the frequency that events gets recorded has increased significantly during the 90s. With that in mind and to lessen inflations effect on the economic part of the analysis we will look on the data from 1990 until november 2011.
storm.data <- storm.data %>% filter(year(storm.data$BGN_DATE)>=1990)
# analysis of event types
events <- as.data.frame(table(storm.data$EVTYPE))
events <- events %>% arrange(desc(Freq)) %>% mutate(Event=tolower(Var1), Var1=NULL) %>%
select(one_of(c("Event", "Freq")))
str(events)
## 'data.frame': 977 obs. of 2 variables:
## $ Event: chr "hail" "tstm wind" "thunderstorm wind" "flash flood" ...
## $ Freq : int 240945 147991 82563 54278 29764 25326 20843 20212 15755 15708 ...
head(events, 20)
## Event Freq
## 1 hail 240945
## 2 tstm wind 147991
## 3 thunderstorm wind 82563
## 4 flash flood 54278
## 5 tornado 29764
## 6 flood 25326
## 7 thunderstorm winds 20843
## 8 high wind 20212
## 9 lightning 15755
## 10 heavy snow 15708
## 11 heavy rain 11723
## 12 winter storm 11433
## 13 winter weather 7026
## 14 funnel cloud 6839
## 15 marine tstm wind 6175
## 16 marine thunderstorm wind 5812
## 17 waterspout 3797
## 18 strong wind 3566
## 19 urban/sml stream fld 3392
## 20 wildfire 2761
NOAA has defined 48 different event types, but the amount of registered event types is over 900. We will try to categorize the over 900 different registered types. This categorization will be an approcimation, where we try to atleast make sure that the most common registered events will be properly categorized, and also try to group the types further than what NOAA do.
storm.data <- mutate(storm.data, EVTYPE=tolower(EVTYPE))
storm.data$EVTYPE[grepl("hail", storm.data$EVTYPE)] <- "Hail"
storm.data$EVTYPE[grepl("tstm|thunder", storm.data$EVTYPE)] <- "Thunderstorm"
storm.data$EVTYPE[grepl("flash|flood|fld", storm.data$EVTYPE)] <- "Flood"
storm.data$EVTYPE[grepl("tornado|funnel|waterspout", storm.data$EVTYPE)] <- "Tornado"
storm.data$EVTYPE[grepl("light", storm.data$EVTYPE)] <- "Lightning"
storm.data$EVTYPE[grepl("snow|ice|blizzard|sleet|icy", storm.data$EVTYPE)] <- "Ice Storm"
storm.data$EVTYPE[grepl("rain|shower|precip", storm.data$EVTYPE)] <- "Rain"
storm.data$EVTYPE[grepl("wint|freez|cool|frost|cold|chill|glaze", storm.data$EVTYPE)] <- "Cold"
storm.data$EVTYPE[grepl("wind", storm.data$EVTYPE)] <- "Wind"
storm.data$EVTYPE[grepl("fire", storm.data$EVTYPE)] <- "Fire"
storm.data$EVTYPE[grepl("drought|dry", storm.data$EVTYPE)] <- "Drought"
storm.data$EVTYPE[grepl("heat|hot|warm|record", storm.data$EVTYPE)] <- "Heat"
storm.data$EVTYPE[grepl("fog", storm.data$EVTYPE)] <- "Fog"
storm.data$EVTYPE[grepl("wave|tide|surf|rip current|storm surge|tsunami", storm.data$EVTYPE)] <- "Wave & Tide"
storm.data$EVTYPE[grepl("hurricane|typhoon|tropical", storm.data$EVTYPE)] <- "Hurricane"
storm.data$EVTYPE[grepl("slide", storm.data$EVTYPE)] <- "Landslide"
storm.data$EVTYPE[grepl("dust", storm.data$EVTYPE)] <- "Dust Storm"
storm.data$EVTYPE[grepl("smoke", storm.data$EVTYPE)] <- "Smoke"
# some of the events are clearly not registered correctly by their event type, so they are removed.
storm.data <- storm.data[!grepl("summary", storm.data$EVTYPE),]
storm.data <- storm.data[!grepl("other", storm.data$EVTYPE),]
storm.data <- storm.data[!grepl("\\?", storm.data$EVTYPE),]
By grouping the data on the event types, and looking at fatalities and injuries, we can see which events are the most harmfull to the population.
harm.events <- storm.data %>% group_by(event=EVTYPE) %>%
summarise(tot.fat=sum(FATALITIES), tot.inj=sum(INJURIES)) %>%
arrange(desc(tot.fat+tot.inj)) %>%
head(10) %>% melt(id=1)
colnames(harm.events) <- c("event", "harm.type", "value")
harm.events$event <- reorder(harm.events$event, harm.events$value)
ggplot(data=harm.events, aes(x=event, y=value, fill=harm.type)) +
geom_bar(stat="identity", position="stack") + ylab("casualties") +
coord_flip() + ggtitle("Harmfull events to the population 1990 - 2011") +
scale_fill_discrete(name="legend", labels=c("fatalities", "injuries"))
The graphic shows that in the period 1990-2011 tornados have been the most harmfull, when you look at the amount of total casualties. But more importantly, we can also see that heat has caused the highest amount of fatalities.
In this part we will look at the variables that register the estimated expenses in property damage and crop damage
# to calculate the economic effect we have to combine the prop/crop-dmg with the multiplier in prop/crop-dmgexp
# a helpful function
calc <- function(dmg, exp){
if (is.numeric(exp)==TRUE)
expo <- as.numeric(exp)
if (exp == "h")
expo <- 2
else if (exp[1] == "k")
expo <- 3
else if (exp[1] == "m")
expo <- 6
else if (exp[1] == "b")
expo <- 9
else
expo <- 0
dmg * 10^expo
}
# clean exp variables
storm.data$PROPDMGEXP <- tolower(storm.data$PROPDMGEXP)
storm.data$CROPDMGEXP <- tolower(storm.data$CROPDMGEXP)
storm.data$PROPDMGEXP[is.na(storm.data$PROPDMGEXP)] <- 0
storm.data$CROPDMGEXP[is.na(storm.data$CROPDMGEXP)] <- 0
# add calculated expenses to the df
storm.data <- mutate(storm.data, prop.dmg = mapply(calc, PROPDMG, PROPDMGEXP))
storm.data <- mutate(storm.data, crop.dmg = mapply(calc, CROPDMG, CROPDMGEXP))
# calculate and show the 10 most expensive weather events
tot.dmg <- storm.data %>% group_by(EVTYPE) %>% summarise(total=sum(prop.dmg+crop.dmg)) %>% arrange(desc(total))
tot.dmg10 <- tot.dmg[1:10,]
tot.dmg10
## Source: local data frame [10 x 2]
##
## EVTYPE total
## (chr) (dbl)
## 1 Flood 179976092782
## 2 Hurricane 99173551360
## 3 Wave & Tide 48221044150
## 4 Tornado 30939173679
## 5 Hail 20734204440
## 6 Drought 15025675380
## 7 Thunderstorm 12138241370
## 8 Ice Storm 10914748662
## 9 Cold 10528439651
## 10 Fire 8899910130
We can clearly see that in the period 1990-2011 that flooding has caused the highest amout of expenses, followed by hurricanes and wave & tide events. Next We will visualize the data by year.
# sum the expenses by year
tot.dmg.year <- storm.data %>% filter(EVTYPE %in% tot.dmg10$EVTYPE) %>%
group_by(year=year(BGN_DATE), EVTYPE) %>%
summarise(total=sum(prop.dmg+crop.dmg))
# visualize in a lineplot by event over the years 1990-2011
ggplot(data=tot.dmg.year, aes(x=year, y=total/1000000, colour=EVTYPE)) + geom_line() +
ylab("total expenses in millions") + ggtitle("Total expenses by year and event") +
scale_color_discrete(name="legend")
The graphic shows that the economic consequences is very spiky with Flood spike in 2006, and smaller Hurricane and Wave & Tide spike in the year before. With spiky events that will have expenses far exceeding what you can cointain in one budget year we have to look for other solutions. Probably have to save up in funds, waiting to be used in years with excesive spikes.
Suggested topics for further study :
Include the lat, long from the original data to plot on a map, to combine events and regions.
Split data into statewise analysis.
Add analysis on month, to study seasonal events.
Enhance data with inflation data to eliminate yearly inflation