Synopsis
The basic objective of this study is to explore the NOAA Storm database and answer some basic questions about severe weather events in the United States.
- Which types of events are most harmful with respect to population health?
- Which types of events have the greatest economic consequences?
Data Processing
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
The data for this study come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file through this link:
Storm Data [47Mb]
In R, we can download, extract and read the data files by running the following script:
Show Code
require(readr) # read csv in tibble format
URL <- 'https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2'
if (!file.exists('stormData.bz2')){
# download
download.file(URL, destfile = 'stormData.bz2', method = 'curl')
# extract
stormFile <- bzfile('stormData.bz2')
# read
stormData <- read_csv(stormFile)
# delete .bz2 file
invisible(file.remove('stormData.bz2'))
}
Our raw dataset has 902297 observations of 37 variables detailed on this link, however, in our analysis, we will only use the following variables:
- STATE: State abbreviation;
- DATE: Date in m/d/YYYY format;
- EVTYPE: Event Type (Tornado, Tsunami, Hurricane, etc);
- FATALITIES: Count of fatal victims;
- INJURIES: Count of injured victims;
- PROPDMG: Property Damage;
- CROPDMG: Crop Damage;
Show Code
Our variable of climatic events (EVTYPE) is not ready for analysis, there are duplicate events, invalid events, different formats. See the first distinct factors of the EVTYPE variable
## [1] "TORNADO" "TSTM WIND"
## [3] "HAIL" "FREEZING RAIN"
## [5] "SNOW" "ICE STORM/FLASH FLOOD"
## [7] "SNOW/ICE" "WINTER STORM"
## [9] "HURRICANE OPAL/HIGH WINDS" "THUNDERSTORM WINDS"
## [11] "RECORD COLD" "HURRICANE ERIN"
## [13] "HURRICANE OPAL" "HEAVY RAIN"
## [15] "LIGHTNING" "THUNDERSTORM WIND"
## [17] "DENSE FOG" "RIP CURRENT"
## [19] "THUNDERSTORM WINS" "FLASH FLOOD"
## [21] "FLASH FLOODING" "HIGH WINDS"
## [23] "FUNNEL CLOUD" "TORNADO F0"
## [25] "THUNDERSTORM WINDS LIGHTNING"
Thus, we cannot proceed with our analysis without first cleaning and preprocessing this variable. We will use text mining techniques in R to remove as many factors as possible and make the variable more analyzable.
Show Code
require(tm) # Text Mining - Useful transformations
require(textstem) # Text Mining - Lemmatization
stormData <- stormData %>%
mutate(EVTYPE = tolower(EVTYPE), # lower case text
EVTYPE = str_replace_all(EVTYPE, '/.*', ''), # remove secondary tags
EVTYPE = removeNumbers(EVTYPE), # remove all numbers
EVTYPE = removePunctuation(EVTYPE), # remove all punctuation
EVTYPE = removeWords(EVTYPE, stopwords('en')), # remove stopwords
EVTYPE = stripWhitespace(EVTYPE), # remove multiples whitespaces
EVTYPE = lemmatize_strings(EVTYPE), # EVTYPE lemmatization
# now let's replace some words (analyzed manually)
EVTYPE = str_replace_all(EVTYPE, "tstm", "thunderstorm"),
EVTYPE = str_replace_all(EVTYPE, "windd|win$|wi$", "wind"),
EVTYPE = str_replace_all(EVTYPE, "\\s(f|g)$", ""),
EVTYPE = str_to_title(EVTYPE)) # "Title" string type format
let’s see how the transformed weather event variable looks now
## [1] "Tornado" "Thunderstorm Wind"
## [3] "Hail" "Freeze Rain"
## [5] "Snow" "Ice Storm"
## [7] "Winter Storm" "Hurricane Opal"
## [9] "Record Cold" "Hurricane Erin"
## [11] "Heavy Rain" "Lightning"
## [13] "Dense Fog" "Rip Current"
## [15] "Flash Flood" "High Wind"
## [17] "Funnel Cloud" "Thunderstorm Wind Lightning"
## [19] "Heat" "Wind"
## [21] "Light" "Lightning Heavy Rain"
## [23] "Funnel" "Wall Cloud"
## [25] "Flood"
Much better right?
To continue our analysis, we will summarize our data set, how? We will group the variables: YEAR, created from DATE, and the type of weather event (EVTYPE). For each combination of these two variables, we will calculate the sum of Injuries (INJURIES), Fatalities (FATALITIES), Property Damage (PROPDMG) and harvest damage (CROPDMG).
Show Code
The head of our final dataset looks like this
YEAR | EVTYPE | INJURIES | FATALITIES | PROPDMG | CROPDMG |
---|---|---|---|---|---|
1950 | Tornado | 659 | 70 | 16999.15 | 0 |
1951 | Tornado | 524 | 34 | 10560.99 | 0 |
1952 | Tornado | 1915 | 230 | 16679.74 | 0 |
1953 | Tornado | 5131 | 519 | 19182.20 | 0 |
1954 | Tornado | 715 | 36 | 23367.82 | 0 |
1955 | Hail | 0 | 0 | 0.00 | 0 |
Results
After having the data set ready, we will graphically analyze the behavior of some weather events. We will choose the 10 events that have the highest average of the variable in question.
First, we will examine injuries and deaths annually starting in 1996, as in previous years we had few records.
Show Code
require(ggplot2) # grammar of graphics
require(scales) # log10 transformation
require(RColorBrewer) # collor pallete
require(egg) # par(mfrow = ...) equivallent
require(ggpubr) # commom title ggarrange
stormDataFiltered <- stormData %>%
group_by(EVTYPE) %>%
dplyr::filter(YEAR > 1995,
mean(INJURIES) > 80)
INJ <- stormData %>%
dplyr::filter(YEAR > 1995) %>%
ggplot() +
geom_point(aes(YEAR, INJURIES, colour = EVTYPE),
data = stormDataFiltered, size = 1) +
geom_line(aes(YEAR, INJURIES, group = EVTYPE),
colour = alpha("grey", 0.7), size = 0.7) +
geom_line(aes(YEAR, INJURIES, colour = EVTYPE),
data = stormDataFiltered, size = 1) +
xlab('') + ylab('Injuries') +
scale_x_continuous(breaks = seq(1996, 2011, 2)) +
scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x),
labels = trans_format("log10", math_format(10^.x))) +
theme_bw() + theme(legend.position="bottom",
legend.spacing.x = unit(0, 'cm')) +
guides(color = guide_legend(nrow = 3, byrow = T))
stormDataFiltered <- stormData %>%
group_by(EVTYPE) %>%
dplyr::filter(YEAR > 1995,
mean(FATALITIES) > 14.8)
FAT <- stormData %>%
dplyr::filter(YEAR > 1995) %>%
ggplot() +
geom_point(aes(YEAR, FATALITIES, colour = EVTYPE),
data = stormDataFiltered, size = 1) +
geom_line(aes(YEAR, FATALITIES, group = EVTYPE),
colour = alpha("grey", 0.7), size = 0.7) +
geom_line(aes(YEAR, FATALITIES, colour = EVTYPE),
data = stormDataFiltered, size = 1) +
xlab('') + ylab('Fatalities') +
scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x),
labels = trans_format("log10", math_format(10^.x))) +
scale_x_continuous(breaks = seq(1996, 2011, 2)) +
theme_bw() + theme(legend.position="bottom",
legend.spacing.x = unit(0, 'cm')) +
guides(color = guide_legend(nrow = 3, byrow = T))
figure <- ggarrange(INJ, FAT, nrow = 1)
annotate_figure(figure,
top = text_grob('Top 10 Injuries and Fatalities per Year caused by Weather Events'),
bottom = text_grob("Data source: \n NOAA Storm database", color = "blue",
hjust = 1, x = 1, face = "italic", size = 10))
We can infer from the graph analysis:
Tornadoes have maintained a standard behavior around an average in the years studied. However, in 2011 (Last year of measurement) injuries and fatalities increased atypically.
Ice Storm since 1996 has been reducing its injury count, as well as the extreme cold. As a consequence, the number of fatalities is not in the top 10.
Climatic events associated with heat show high variability in the count of injuries and, over the years studied, showed a tendency to increase in the growth of fatalities. This behavior of high heat can be a contributor to the fall of problems associated with ice, flood, and cold.
Now, we will examine property and crop damage annually starting in 1996, since in previous years we had few records.
Show Code
stormDataFiltered <- stormData %>%
group_by(EVTYPE) %>%
dplyr::filter(YEAR > 1995,
mean(PROPDMG) > 5000)
PROP <- stormData %>%
dplyr::filter(YEAR > 1995) %>%
ggplot() +
geom_point(aes(YEAR, PROPDMG, colour = EVTYPE),
data = stormDataFiltered, size = 1) +
geom_line(aes(YEAR, PROPDMG, group = EVTYPE),
colour = alpha("grey", 0.7), size = 0.7) +
geom_line(aes(YEAR, PROPDMG, colour = EVTYPE),
data = stormDataFiltered, size = 1) +
xlab('') + ylab('Property Damage Amount') +
#scale_x_continuous(breaks = seq(1996, 2011, 2)) +
scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x),
labels = trans_format("log10", math_format(10^.x))) +
theme_bw() + theme(legend.position="bottom",
legend.spacing.x = unit(0, 'cm')) +
guides(color = guide_legend(nrow = 3, byrow = T))
stormDataFiltered <- stormData %>%
group_by(EVTYPE) %>%
dplyr::filter(YEAR > 1995,
mean(CROPDMG) > 600)
CROP <- stormData %>%
dplyr::filter(YEAR > 1995) %>%
ggplot() +
geom_point(aes(YEAR, CROPDMG, colour = EVTYPE),
data = stormDataFiltered, size = 1) +
geom_line(aes(YEAR, CROPDMG, group = EVTYPE),
colour = alpha("grey", 0.7), size = 0.7) +
geom_line(aes(YEAR, CROPDMG, colour = EVTYPE),
data = stormDataFiltered, size = 1) +
xlab('') + ylab('Crop Damage Amount') +
scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x),
labels = trans_format("log10", math_format(10^.x))) +
scale_x_continuous(breaks = seq(1996, 2011, 2)) +
theme_bw() + theme(legend.position="bottom",
legend.spacing.x = unit(0, 'cm')) +
guides(color = guide_legend(nrow = 3, byrow = T))
figure <- ggarrange(PROP, CROP, nrow = 1)
annotate_figure(figure,
top = text_grob('Top 10 Property and Crop Damage Amount per Year caused by Weather Events'),
bottom = text_grob("Data source: \n NOAA Storm database", color = "blue",
hjust = 1, x = 1, face = "italic", size = 10))
We can deduce from the graphical analysis:
Except for wildfire, all events in the top 10 of property damage have a constant rate each year and low variability, indicating some control. The fact that wildfire has high variability and has a high growth trend may be due to the high heat observed in the previous graph.
Regarding crop damage, all events appear to have certain control, that is, they have low variability around a constant average over the years. However, the river flood event has high variability and is apparently out of control.
Until next time,
Daniel Amaral