Synopsis

The basic objective of this study is to explore the NOAA Storm database and answer some basic questions about severe weather events in the United States.

Which types of events are most harmful with respect to population health?
Which types of events have the greatest economic consequences?

Data Processing

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

The data for this study come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file through this link:

Storm Data [47Mb]

In R, we can download, extract and read the data files by running the following script:

Show Code

require(readr) # read csv in tibble format

URL <- 'https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2'
if (!file.exists('stormData.bz2')){
  # download
  download.file(URL, destfile = 'stormData.bz2',  method = 'curl')
  # extract 
  stormFile <- bzfile('stormData.bz2')
  # read
  stormData <- read_csv(stormFile)  
  # delete .bz2 file
  invisible(file.remove('stormData.bz2'))
}

Our raw dataset has 902297 observations of 37 variables detailed on this link, however, in our analysis, we will only use the following variables:

STATE: State abbreviation;
DATE: Date in m/d/YYYY format;
EVTYPE: Event Type (Tornado, Tsunami, Hurricane, etc);
FATALITIES: Count of fatal victims;
INJURIES: Count of injured victims;
PROPDMG: Property Damage;
CROPDMG: Crop Damage;

Show Code

require(magrittr) # %>% operator
require(lubridate) # date manipulation
require(dplyr) # data manipulation

stormData <- stormData %>%
  mutate(DATE = mdy_hms(BGN_DATE)) %>% # convert to datetime format
  select(DATE, STATE, EVTYPE, FATALITIES, INJURIES, 
         PROPDMG, CROPDMG)

Our variable of climatic events (EVTYPE) is not ready for analysis, there are duplicate events, invalid events, different formats. See the first distinct factors of the EVTYPE variable

Show Code

stormData %>%
  pull(EVTYPE) %>%
  unique() %>% head(25)

##  [1] "TORNADO"                      "TSTM WIND"                   
##  [3] "HAIL"                         "FREEZING RAIN"               
##  [5] "SNOW"                         "ICE STORM/FLASH FLOOD"       
##  [7] "SNOW/ICE"                     "WINTER STORM"                
##  [9] "HURRICANE OPAL/HIGH WINDS"    "THUNDERSTORM WINDS"          
## [11] "RECORD COLD"                  "HURRICANE ERIN"              
## [13] "HURRICANE OPAL"               "HEAVY RAIN"                  
## [15] "LIGHTNING"                    "THUNDERSTORM WIND"           
## [17] "DENSE FOG"                    "RIP CURRENT"                 
## [19] "THUNDERSTORM WINS"            "FLASH FLOOD"                 
## [21] "FLASH FLOODING"               "HIGH WINDS"                  
## [23] "FUNNEL CLOUD"                 "TORNADO F0"                  
## [25] "THUNDERSTORM WINDS LIGHTNING"

Thus, we cannot proceed with our analysis without first cleaning and preprocessing this variable. We will use text mining techniques in R to remove as many factors as possible and make the variable more analyzable.

Show Code

require(tm) # Text Mining - Useful transformations
require(textstem) # Text Mining - Lemmatization

stormData <- stormData %>%
  mutate(EVTYPE = tolower(EVTYPE), # lower case text
         EVTYPE = str_replace_all(EVTYPE, '/.*', ''), # remove secondary tags
         EVTYPE = removeNumbers(EVTYPE), # remove all numbers
         EVTYPE = removePunctuation(EVTYPE), # remove all punctuation
         EVTYPE = removeWords(EVTYPE, stopwords('en')), # remove stopwords
         EVTYPE = stripWhitespace(EVTYPE), # remove multiples whitespaces
         EVTYPE = lemmatize_strings(EVTYPE), # EVTYPE lemmatization
         # now let's replace some words (analyzed manually)
         EVTYPE = str_replace_all(EVTYPE, "tstm", "thunderstorm"),
         EVTYPE = str_replace_all(EVTYPE, "windd|win$|wi$", "wind"),
         EVTYPE = str_replace_all(EVTYPE, "\\s(f|g)$", ""),
         EVTYPE = str_to_title(EVTYPE)) # "Title" string type format

let’s see how the transformed weather event variable looks now

Show Code

stormData %>%
  pull(EVTYPE) %>%
  unique() %>% head(25)

##  [1] "Tornado"                     "Thunderstorm Wind"          
##  [3] "Hail"                        "Freeze Rain"                
##  [5] "Snow"                        "Ice Storm"                  
##  [7] "Winter Storm"                "Hurricane Opal"             
##  [9] "Record Cold"                 "Hurricane Erin"             
## [11] "Heavy Rain"                  "Lightning"                  
## [13] "Dense Fog"                   "Rip Current"                
## [15] "Flash Flood"                 "High Wind"                  
## [17] "Funnel Cloud"                "Thunderstorm Wind Lightning"
## [19] "Heat"                        "Wind"                       
## [21] "Light"                       "Lightning Heavy Rain"       
## [23] "Funnel"                      "Wall Cloud"                 
## [25] "Flood"

Much better right?

To continue our analysis, we will summarize our data set, how? We will group the variables: YEAR, created from DATE, and the type of weather event (EVTYPE). For each combination of these two variables, we will calculate the sum of Injuries (INJURIES), Fatalities (FATALITIES), Property Damage (PROPDMG) and harvest damage (CROPDMG).

Show Code

stormData <- stormData %>%
  mutate(YEAR = year(DATE)) %>%
  group_by(YEAR, EVTYPE) %>%
  summarise(YEAR, EVTYPE, 
            INJURIES = sum(INJURIES),
            FATALITIES = sum(FATALITIES),
            PROPDMG = sum(PROPDMG),
            CROPDMG = sum(CROPDMG)) %>%
  distinct()

The head of our final dataset looks like this

Show Code

require(knitr) # table styles in rmarkdown

kable(head(stormData))

YEAR	EVTYPE	INJURIES	FATALITIES	PROPDMG
1950	Tornado	659	70	16999.15
1951	Tornado	524	34	10560.99
1952	Tornado	1915	230	16679.74
1953	Tornado	5131	519	19182.20
1954	Tornado	715	36	23367.82
1955	Hail	0	0	0.00

Results

After having the data set ready, we will graphically analyze the behavior of some weather events. We will choose the 10 events that have the highest average of the variable in question.

First, we will examine injuries and deaths annually starting in 1996, as in previous years we had few records.

Show Code

require(ggplot2) # grammar of graphics
require(scales) # log10 transformation
require(RColorBrewer) # collor pallete
require(egg) # par(mfrow = ...) equivallent
require(ggpubr) # commom title ggarrange

stormDataFiltered <- stormData %>% 
  group_by(EVTYPE) %>%
  dplyr::filter(YEAR > 1995,
                mean(INJURIES) > 80) 
INJ <- stormData %>%
  dplyr::filter(YEAR > 1995) %>%
  ggplot() +
  geom_point(aes(YEAR, INJURIES, colour = EVTYPE), 
             data = stormDataFiltered, size = 1) + 
  geom_line(aes(YEAR, INJURIES, group = EVTYPE), 
            colour = alpha("grey", 0.7), size = 0.7) +
  geom_line(aes(YEAR, INJURIES, colour = EVTYPE), 
            data = stormDataFiltered, size = 1) +
  xlab('') + ylab('Injuries') +
  scale_x_continuous(breaks = seq(1996, 2011, 2)) +
  scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x),
              labels = trans_format("log10", math_format(10^.x))) +
  theme_bw() + theme(legend.position="bottom",
        legend.spacing.x = unit(0, 'cm'))  + 
  guides(color = guide_legend(nrow = 3, byrow = T))

stormDataFiltered <- stormData %>% 
  group_by(EVTYPE) %>%
  dplyr::filter(YEAR > 1995,
                mean(FATALITIES) > 14.8) 
FAT <- stormData %>%
  dplyr::filter(YEAR > 1995) %>%
  ggplot() +
  geom_point(aes(YEAR, FATALITIES, colour = EVTYPE), 
             data = stormDataFiltered, size = 1) + 
  geom_line(aes(YEAR, FATALITIES, group = EVTYPE), 
            colour = alpha("grey", 0.7), size = 0.7) +
  geom_line(aes(YEAR, FATALITIES, colour = EVTYPE), 
            data = stormDataFiltered, size = 1) +
  xlab('') + ylab('Fatalities') +
  scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x),
              labels = trans_format("log10", math_format(10^.x))) +
  scale_x_continuous(breaks = seq(1996, 2011, 2)) +
  theme_bw() + theme(legend.position="bottom",
        legend.spacing.x = unit(0, 'cm')) + 
  guides(color = guide_legend(nrow = 3, byrow = T))

figure <- ggarrange(INJ, FAT, nrow = 1)
annotate_figure(figure,
                top = text_grob('Top 10 Injuries and Fatalities per Year caused by Weather Events'),
                bottom = text_grob("Data source: \n NOAA Storm database", color = "blue", 
                                   hjust = 1, x = 1, face = "italic", size = 10))

We can infer from the graph analysis:

Tornadoes have maintained a standard behavior around an average in the years studied. However, in 2011 (Last year of measurement) injuries and fatalities increased atypically.

Ice Storm since 1996 has been reducing its injury count, as well as the extreme cold. As a consequence, the number of fatalities is not in the top 10.

Climatic events associated with heat show high variability in the count of injuries and, over the years studied, showed a tendency to increase in the growth of fatalities. This behavior of high heat can be a contributor to the fall of problems associated with ice, flood, and cold.

Now, we will examine property and crop damage annually starting in 1996, since in previous years we had few records.

Show Code

stormDataFiltered <- stormData %>% 
  group_by(EVTYPE) %>%
  dplyr::filter(YEAR > 1995,
                mean(PROPDMG) > 5000) 
PROP <- stormData %>%
  dplyr::filter(YEAR > 1995) %>%
  ggplot() +
  geom_point(aes(YEAR, PROPDMG, colour = EVTYPE), 
             data = stormDataFiltered, size = 1) + 
  geom_line(aes(YEAR, PROPDMG, group = EVTYPE), 
            colour = alpha("grey", 0.7), size = 0.7) +
  geom_line(aes(YEAR, PROPDMG, colour = EVTYPE), 
            data = stormDataFiltered, size = 1) +
  xlab('') + ylab('Property Damage Amount') +
  #scale_x_continuous(breaks = seq(1996, 2011, 2)) +
  scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x),
              labels = trans_format("log10", math_format(10^.x))) +
  theme_bw() + theme(legend.position="bottom",
        legend.spacing.x = unit(0, 'cm'))  + 
  guides(color = guide_legend(nrow = 3, byrow = T))

stormDataFiltered <- stormData %>% 
  group_by(EVTYPE) %>%
  dplyr::filter(YEAR > 1995,
                mean(CROPDMG) > 600) 
CROP <- stormData %>%
  dplyr::filter(YEAR > 1995) %>%
  ggplot() +
  geom_point(aes(YEAR, CROPDMG, colour = EVTYPE), 
             data = stormDataFiltered, size = 1) + 
  geom_line(aes(YEAR, CROPDMG, group = EVTYPE), 
            colour = alpha("grey", 0.7), size = 0.7) +
  geom_line(aes(YEAR, CROPDMG, colour = EVTYPE), 
            data = stormDataFiltered, size = 1) +
  xlab('') + ylab('Crop Damage Amount') +
  scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x),
              labels = trans_format("log10", math_format(10^.x))) +
  scale_x_continuous(breaks = seq(1996, 2011, 2)) +
  theme_bw() + theme(legend.position="bottom",
        legend.spacing.x = unit(0, 'cm')) + 
  guides(color = guide_legend(nrow = 3, byrow = T))

figure <- ggarrange(PROP, CROP, nrow = 1)
annotate_figure(figure,
                top = text_grob('Top 10 Property and Crop Damage Amount per Year caused by Weather Events'),
                bottom = text_grob("Data source: \n NOAA Storm database", color = "blue", 
                                   hjust = 1, x = 1, face = "italic", size = 10))

We can deduce from the graphical analysis:

Except for wildfire, all events in the top 10 of property damage have a constant rate each year and low variability, indicating some control. The fact that wildfire has high variability and has a high growth trend may be due to the high heat observed in the previous graph.

Regarding crop damage, all events appear to have certain control, that is, they have low variability around a constant average over the years. However, the river flood event has high variability and is apparently out of control.

Until next time,

Daniel Amaral