The NOAA database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage. In this project we explore the NOAA Storm Database and answer a few basic questions about severe weather events. In particular, we are interested in the following: which types of events are most harmful with respect to population health across the United States, and which types of events have the greatest economic consequences across the United States. We find that, over the period from 1993 through 2011, extreme temperature events resulted in the most fatalities, Illinois being the state that suffered the most (998 fatalities). Texas suffered from the greatest number of flood events, resulting in the most number of injuries (6,951). The state of California suffered the most property damages, mainly from flood events ($117.4 billion).
This document is organized as follows
R, and can be time consuming. (The zipped database is about 49 Mb and unzips to about one-half Gb csv file).EVTYPE) - 985 categories. We group these events into 5 major categories: Convection, Extreme Temperatures, Flood, Winter, and Other. The rationale for choosing these categories is drawn from this document.The data (in bz2 zipped format) were downloaded from this url and stored locally as a csv (comma-separated values) file. The following piece of R code accomplishes this:
if (!file.exists("data")) dir.create("data")
if (!file.exists("./data/stormData.csv")) {
temp <- tempfile()
fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(fileUrl, temp)
data <- read.csv(bzfile(temp))
write.csv(data, file = "./data/stormData.csv", row.names = FALSE)
unlink(temp)
}
We read the file into R and check the first few rows (of 902,297 total rows).
Of the total of 37 variables only a handful will be of interest to us.
data <- read.csv("./data/stormData.csv")
dim(data)
## [1] 902297 37
head(data[, 1:7])
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL
We begin with the csv file that was created by running the previous step. We
convert the data to a data.table (using package data.table) for faster processing.
We then extract those observations that contain entries for fatalities, injuries,
crop damage, or property damage:
A few things we do at the beginning:
data.table (package data.table) for faster processingEVTYPE (event type) column of this data. The
previous step saved us from a lot of unnecessary cleanup by removing non-essential data,library(data.table)
DT <- data.table(data)
## A) read the downloaded data (takes a bit of time):
data <- read.csv("./data/stormData.csv")
## B) read it into data.table for faster processing:
DT <- data.table(data) #dim(DT) = 902297rows 37cols
## C) select only the following observations:
DT <- DT[DT$FATALITIES > 0 | DT$INJURIES > 0 | DT$PROPDMG > 0 | DT$CROPDMG >
0, ]
## D) further, select only the observations from the 50 states
DT <- DT[STATE %in% state.abb]
data <- file.path(getwd(), "SmallerDataFile.csv")
# read the smaller data (created with `reduce_dataset_and_save.R`) into
# dataframe
df <- read.csv("SmallerDataFile.csv", header = TRUE)
# convert it to a `data.table` for faster processing
dt <- data.table(df)
After the preprocessing step, the dataset has been reduced to 253,210 observations - a size a bit more manageable than that of the original dataset.
We next create the 5 different categories for the Event Types, and find the events
that fall into these 5 categories. Also, create intidactor variables for each
of the categories. For any event, its indicator variable takes on the value of
TRUE if that occurred and FALSE otherwise. For example FloodI indicator shows
whether the event in question was categorizes as a Flood event or not.
setnames(dt, names(dt), tolower(names(dt)))
# Category 1: `Convection` Lightning, Tornado, Thunderstorm, Wind, and Hail
# fall into this category
lightning <- "\\bL\\S+?G\\b"
tornado <- "(NADO)|(\\bTOR\\S+?O\\b|(\\bFUN))"
thunderstorm <- "THUNDERSTORM|TSTM"
wind <- "(WIND)|(WND)"
hail <- "HAIL"
regex <- paste(lightning, tornado, thunderstorm, wind, hail, sep = "|")
indicator <- grepl(regex, dt$evtype, ignore.case = TRUE)
dt <- dt[, `:=`(ConvectionI, indicator)]
# here is the list:
uniqueEventTypes <- unique(dt[indicator, evtype])
# show(uniqueEventTypes[order(uniqueEventTypes)])
# Category 2: `Extreme Temperatures` Cold and Heat fall into this category
regex <- "COLD|HEAT"
indicator <- grepl(regex, dt$evtype, ignore.case = TRUE)
dt <- dt[, `:=`(ExtremeTempI, indicator)]
# here is the list:
uniqueEventTypes <- unique(dt[indicator, evtype])
show(uniqueEventTypes[order(uniqueEventTypes)])
For example, the following events fall into the Extreme Temperatures category:
uniqueEventTypes <- unique(dt[indicator, evtype])
show(uniqueEventTypes[order(uniqueEventTypes)])
## [1] Cold COLD
## [3] COLD AIR TORNADO COLD AND SNOW
## [5] COLD AND WET CONDITIONS Cold Temperature
## [7] COLD WAVE COLD WEATHER
## [9] COLD/WIND CHILL COLD/WINDS
## [11] DROUGHT/EXCESSIVE HEAT EXCESSIVE HEAT
## [13] Extended Cold Extreme Cold
## [15] EXTREME COLD EXTREME COLD/WIND CHILL
## [17] EXTREME HEAT FOG AND COLD TEMPERATURES
## [19] HEAT Heat Wave
## [21] HEAT WAVE HEAT WAVE DROUGHT
## [23] HEAT WAVES HIGH WINDS/COLD
## [25] RECORD COLD RECORD HEAT
## [27] RECORD/EXCESSIVE HEAT SNOW/ BITTER COLD
## [29] SNOW/COLD Unseasonable Cold
## [31] UNSEASONABLY COLD
## 471 Levels: FLASH FLOOD TSTM WIND TSTM WIND (G45) ... WINTRY MIX
# Category 3: `Flood` Flod and Rain fall into this category
flood <- "(\\bFL\\S+?D)"
rain <- "RAIN|PRECIP|SHOWER"
regex <- paste(flood, rain, sep = "|")
indicator <- grepl(regex, dt$evtype, ignore.case = TRUE)
dt <- dt[, `:=`(FloodI, indicator)]
uniqueEventTypes <- unique(dt[indicator, evtype])
# show(uniqueEventTypes[order(uniqueEventTypes)])
# Category 4: `Winter` Snow, Ice, Freeze, or Winter Weather fall into this
# category
regex <- "(SNOW)|(ICE)|(ICY)|(FREEZ)|(WINT)"
indicator <- grepl(regex, dt$evtype, ignore.case = TRUE)
dt <- dt[, `:=`(WinterI, indicator)]
uniqueEventTypes <- unique(dt[indicator, evtype])
# show(uniqueEventTypes[order(uniqueEventTypes)])
# Category 5: `Other` Everything else that hasn't been categorized so far
# falls into this category
dt <- dt[, `:=`(OtherI, ConvectionI == FALSE & ExtremeTempI == FALSE & FloodI ==
FALSE & WinterI == FALSE)]
And here are the results cross-tabulated:
# Cross Tabulate the results
groupby <- expression(list(ConvectionI, ExtremeTempI, FloodI, WinterI, OtherI))
# head(dt)
dt[, .N, eval(groupby)][order(ConvectionI, ExtremeTempI, FloodI, WinterI, OtherI,
decreasing = TRUE)]
## ConvectionI ExtremeTempI FloodI WinterI OtherI N
## 1: TRUE TRUE FALSE FALSE FALSE 208
## 2: TRUE FALSE TRUE TRUE FALSE 1
## 3: TRUE FALSE TRUE FALSE FALSE 19
## 4: TRUE FALSE FALSE TRUE FALSE 11
## 5: TRUE FALSE FALSE FALSE FALSE 208626
## 6: FALSE TRUE FALSE TRUE FALSE 4
## 7: FALSE TRUE FALSE FALSE FALSE 1227
## 8: FALSE FALSE TRUE TRUE FALSE 96
## 9: FALSE FALSE TRUE FALSE FALSE 33038
## 10: FALSE FALSE FALSE TRUE FALSE 4845
## 11: FALSE FALSE FALSE FALSE TRUE 5135
## Now that we've set up the categories, an additional step is needed We note
## that several records can have multiple events listed in the `EVTYPE`
## variable.
dt <- dt[, `:=`(Category, ifelse(ConvectionI, 1, ifelse(ExtremeTempI, 2, ifelse(FloodI,
3, ifelse(WinterI, 4, ifelse(OtherI, 5, NA))))))]
CategoryName <- c("Convection", "Extreme Temperature", "Flood", "Winter", "Other")
dt <- dt[, `:=`(Category, factor(Category, labels = CategoryName))]
dt[, .N, Category]
## Category N
## 1: Convection 208865
## 2: Flood 33134
## 3: Winter 4845
## 4: Other 5135
## 5: Extreme Temperature 1231
This reveals that a disproportionate number of events fall under the Convection
category. Perhaps there were more observations taken for this category.
Looking at the dates more carefully, we see that:
dt$bgn_date <- as.Date(dt$bgn_date, format = "%m/%d/%Y")
# class(dt$bgn_date)
year <- year(min(dt$bgn_date[dt$Category == "Convection"]))
cat("Earliest observation for Event Convection: ", year)
## Earliest observation for Event Convection: 1950
year <- year(min(dt$bgn_date[dt$Category != "Convection"]))
cat("Earliest observation for Event that is NOT Convection: ", year)
## Earliest observation for Event that is NOT Convection: 1993
############ Therefore, we further subset the dataset and only consider the events
############ after 1993
min.year <- year(min(dt$bgn_date[dt$Category != "Convection"]))
max.year <- year(max(dt$bgn_date)) # this is 2011
dt <- dt[min.year <= year(bgn_date) & year(bgn_date) <= max.year]
# dim(dt)
Thus we will be considering data from 1993 to 2011. This further reduces the number of observations:
dim(dt)
## [1] 225836 43
The property and crop damages inflicted by the storms during the years of 1993-2011 were
as follows (B stands for billion, M- for million, K - for thousand, all in USD):
dt <- dt[, `:=`(propdmgexp, toupper(propdmgexp))]
dt[, .N, propdmgexp]
## propdmgexp N
## 1: 10076
## 2: B 39
## 3: K 206990
## 4: M 8478
## 5: + 5
## 6: 0 210
## 7: 5 18
## 8: 6 3
## 9: 4 4
## 10: H 7
## 11: 2 1
## 12: 7 3
## 13: 3 1
## 14: - 1
dt <- dt[, `:=`(property.damage, ifelse(propdmgexp == "B", propdmg * 1e+09,
ifelse(propdmgexp == "M", propdmg * 1e+06, ifelse(propdmgexp == "K", propdmg *
1000, ifelse(propdmgexp == "H", propdmg * 100, propdmg)))))]
summary(dt$property.damage)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00e+00 2.00e+03 8.00e+03 1.74e+06 3.00e+04 1.15e+11
################## crop damages ##################
dt <- dt[, `:=`(cropdmgexp, toupper(cropdmgexp))]
dt[, .N, cropdmgexp]
## cropdmgexp N
## 1: 124392
## 2: M 1961
## 3: K 99453
## 4: B 7
## 5: ? 6
## 6: 0 17
dt <- dt[, `:=`(crop.damage, ifelse(cropdmgexp == "B", cropdmg * 1e+09, ifelse(cropdmgexp ==
"M", cropdmg * 1e+06, ifelse(cropdmgexp == "K", cropdmg * 1000, ifelse(cropdmgexp ==
"H", cropdmg * 100, cropdmg)))))]
summary(dt$crop.damage)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00e+00 0.00e+00 0.00e+00 2.14e+05 0.00e+00 5.00e+09
#################################################### First, reshape the data
D <- dt
labels <- c("Convection", "Extreme Temperature", "Flood", "Winter", "Other")
D2 <- rbind(D[, list(state, year = year(bgn_date), Category = factor(Category,
labels = labels), outcome = "Fatalities (thousands)", value = fatalities/1000)],
D[, list(state, year = year(bgn_date), Category = factor(Category, labels = labels),
outcome = "Injuries (thousands)", value = injuries/1000)], D[, list(state,
year = year(bgn_date), Category = factor(Category, labels = labels),
outcome = "Property damage ($, billions)", value = propdmg/1e+09)])
### Tabulate by state and category, and by state. For the state tabulation,
### rank each state according to outcome.
tabulateStateCategory <- D2[, list(value = sum(value)), list(state, Category,
outcome)]
tabulateState <- D2[, list(value = sum(value)), list(state, outcome)]
tabulateState <- tabulateState[, `:=`(rank, abs(rank(value, ties.method = "random") -
51)), list(outcome)]
The fatalities resulting from extreme weather over this period are shown in the following table. For example, the state of Illinois suffered the most fatalities, 998, due to extreme temperatures:
top <- tabulateState[grepl("Fatal", outcome) & rank <= 1, state]
where <- expression(state %in% top & grepl("Fatal", outcome))
select <- expression(list(state, value = format(value * 1000, big.mark = ","),
Category))
tabulation <- tabulateStateCategory[eval(where), eval(select)]
tabulation <- tabulation[order(value, decreasing = TRUE)]
print(tabulation)
## state value Category
## 1: IL 998 Extreme Temperature
## 2: IL 181 Convection
## 3: IL 24 Flood
## 4: IL 20 Winter
## 5: IL 4 Other
The injuries resulting from extreme weather are also shown. For example, the state of Texas suffered 6,951 injuries as a result of floods over this period:
top <- tabulateState[grepl("Inj", outcome) & rank <= 1, state]
where <- expression(state %in% top & grepl("Inj", outcome))
select <- expression(list(state, value = format(value * 1000, big.mark = ","),
Category))
tabulation <- tabulateStateCategory[eval(where), eval(select)]
tabulation <- tabulation[order(value, decreasing = TRUE)]
print(tabulation)
## state value Category
## 1: TX 6,951 Flood
## 2: TX 1,997 Convection
## 3: TX 787 Extreme Temperature
## 4: TX 211 Winter
## 5: TX 186 Other
library(ggplot2)
ggplot(tabulateStateCategory, aes(x = state, y = value, fill = Category)) +
geom_bar(alpha = 1/2, stat = "identity") + scale_fill_brewer(name = "Category",
palette = 3) + scale_x_discrete(name = "") + scale_y_continuous(name = "") +
facet_wrap(~outcome, scales = "free", nrow = 3, ncol = 1) + theme(legend.position = "bottom")