Synopsis

The basic objective of this study is to explore the NOAA Storm database and answer some basic questions about severe weather events in the United States.

  • Which types of events are most harmful with respect to population health?
  • Which types of events have the greatest economic consequences?

Data Processing

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

The data for this study come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file through this link:

Storm Data [47Mb]

In R, we can download, extract and read the data files by running the following script:

Show Code

Our raw dataset has 902297 observations of 37 variables detailed on this link, however, in our analysis, we will only use the following variables:

  • STATE: State abbreviation;
  • DATE: Date in m/d/YYYY format;
  • EVTYPE: Event Type (Tornado, Tsunami, Hurricane, etc);
  • FATALITIES: Count of fatal victims;
  • INJURIES: Count of injured victims;
  • PROPDMG: Property Damage;
  • CROPDMG: Crop Damage;

Show Code

Our variable of climatic events (EVTYPE) is not ready for analysis, there are duplicate events, invalid events, different formats. See the first distinct factors of the EVTYPE variable

Show Code

##  [1] "TORNADO"                      "TSTM WIND"                   
##  [3] "HAIL"                         "FREEZING RAIN"               
##  [5] "SNOW"                         "ICE STORM/FLASH FLOOD"       
##  [7] "SNOW/ICE"                     "WINTER STORM"                
##  [9] "HURRICANE OPAL/HIGH WINDS"    "THUNDERSTORM WINDS"          
## [11] "RECORD COLD"                  "HURRICANE ERIN"              
## [13] "HURRICANE OPAL"               "HEAVY RAIN"                  
## [15] "LIGHTNING"                    "THUNDERSTORM WIND"           
## [17] "DENSE FOG"                    "RIP CURRENT"                 
## [19] "THUNDERSTORM WINS"            "FLASH FLOOD"                 
## [21] "FLASH FLOODING"               "HIGH WINDS"                  
## [23] "FUNNEL CLOUD"                 "TORNADO F0"                  
## [25] "THUNDERSTORM WINDS LIGHTNING"

Thus, we cannot proceed with our analysis without first cleaning and preprocessing this variable. We will use text mining techniques in R to remove as many factors as possible and make the variable more analyzable.

Show Code

let’s see how the transformed weather event variable looks now

Show Code

##  [1] "Tornado"                     "Thunderstorm Wind"          
##  [3] "Hail"                        "Freeze Rain"                
##  [5] "Snow"                        "Ice Storm"                  
##  [7] "Winter Storm"                "Hurricane Opal"             
##  [9] "Record Cold"                 "Hurricane Erin"             
## [11] "Heavy Rain"                  "Lightning"                  
## [13] "Dense Fog"                   "Rip Current"                
## [15] "Flash Flood"                 "High Wind"                  
## [17] "Funnel Cloud"                "Thunderstorm Wind Lightning"
## [19] "Heat"                        "Wind"                       
## [21] "Light"                       "Lightning Heavy Rain"       
## [23] "Funnel"                      "Wall Cloud"                 
## [25] "Flood"

Much better right?

To continue our analysis, we will summarize our data set, how? We will group the variables: YEAR, created from DATE, and the type of weather event (EVTYPE). For each combination of these two variables, we will calculate the sum of Injuries (INJURIES), Fatalities (FATALITIES), Property Damage (PROPDMG) and harvest damage (CROPDMG).

Show Code

The head of our final dataset looks like this

Show Code

YEAR EVTYPE INJURIES FATALITIES PROPDMG CROPDMG
1950 Tornado 659 70 16999.15 0
1951 Tornado 524 34 10560.99 0
1952 Tornado 1915 230 16679.74 0
1953 Tornado 5131 519 19182.20 0
1954 Tornado 715 36 23367.82 0
1955 Hail 0 0 0.00 0

Results

After having the data set ready, we will graphically analyze the behavior of some weather events. We will choose the 10 events that have the highest average of the variable in question.

First, we will examine injuries and deaths annually starting in 1996, as in previous years we had few records.

Show Code

require(ggplot2) # grammar of graphics
require(scales) # log10 transformation
require(RColorBrewer) # collor pallete
require(egg) # par(mfrow = ...) equivallent
require(ggpubr) # commom title ggarrange

stormDataFiltered <- stormData %>% 
  group_by(EVTYPE) %>%
  dplyr::filter(YEAR > 1995,
                mean(INJURIES) > 80) 
INJ <- stormData %>%
  dplyr::filter(YEAR > 1995) %>%
  ggplot() +
  geom_point(aes(YEAR, INJURIES, colour = EVTYPE), 
             data = stormDataFiltered, size = 1) + 
  geom_line(aes(YEAR, INJURIES, group = EVTYPE), 
            colour = alpha("grey", 0.7), size = 0.7) +
  geom_line(aes(YEAR, INJURIES, colour = EVTYPE), 
            data = stormDataFiltered, size = 1) +
  xlab('') + ylab('Injuries') +
  scale_x_continuous(breaks = seq(1996, 2011, 2)) +
  scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x),
              labels = trans_format("log10", math_format(10^.x))) +
  theme_bw() + theme(legend.position="bottom",
        legend.spacing.x = unit(0, 'cm'))  + 
  guides(color = guide_legend(nrow = 3, byrow = T))

stormDataFiltered <- stormData %>% 
  group_by(EVTYPE) %>%
  dplyr::filter(YEAR > 1995,
                mean(FATALITIES) > 14.8) 
FAT <- stormData %>%
  dplyr::filter(YEAR > 1995) %>%
  ggplot() +
  geom_point(aes(YEAR, FATALITIES, colour = EVTYPE), 
             data = stormDataFiltered, size = 1) + 
  geom_line(aes(YEAR, FATALITIES, group = EVTYPE), 
            colour = alpha("grey", 0.7), size = 0.7) +
  geom_line(aes(YEAR, FATALITIES, colour = EVTYPE), 
            data = stormDataFiltered, size = 1) +
  xlab('') + ylab('Fatalities') +
  scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x),
              labels = trans_format("log10", math_format(10^.x))) +
  scale_x_continuous(breaks = seq(1996, 2011, 2)) +
  theme_bw() + theme(legend.position="bottom",
        legend.spacing.x = unit(0, 'cm')) + 
  guides(color = guide_legend(nrow = 3, byrow = T))

figure <- ggarrange(INJ, FAT, nrow = 1)
annotate_figure(figure,
                top = text_grob('Top 10 Injuries and Fatalities per Year caused by Weather Events'),
                bottom = text_grob("Data source: \n NOAA Storm database", color = "blue", 
                                   hjust = 1, x = 1, face = "italic", size = 10))

We can infer from the graph analysis:

Tornadoes have maintained a standard behavior around an average in the years studied. However, in 2011 (Last year of measurement) injuries and fatalities increased atypically.

Ice Storm since 1996 has been reducing its injury count, as well as the extreme cold. As a consequence, the number of fatalities is not in the top 10.

Climatic events associated with heat show high variability in the count of injuries and, over the years studied, showed a tendency to increase in the growth of fatalities. This behavior of high heat can be a contributor to the fall of problems associated with ice, flood, and cold.

Now, we will examine property and crop damage annually starting in 1996, since in previous years we had few records.

Show Code

stormDataFiltered <- stormData %>% 
  group_by(EVTYPE) %>%
  dplyr::filter(YEAR > 1995,
                mean(PROPDMG) > 5000) 
PROP <- stormData %>%
  dplyr::filter(YEAR > 1995) %>%
  ggplot() +
  geom_point(aes(YEAR, PROPDMG, colour = EVTYPE), 
             data = stormDataFiltered, size = 1) + 
  geom_line(aes(YEAR, PROPDMG, group = EVTYPE), 
            colour = alpha("grey", 0.7), size = 0.7) +
  geom_line(aes(YEAR, PROPDMG, colour = EVTYPE), 
            data = stormDataFiltered, size = 1) +
  xlab('') + ylab('Property Damage Amount') +
  #scale_x_continuous(breaks = seq(1996, 2011, 2)) +
  scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x),
              labels = trans_format("log10", math_format(10^.x))) +
  theme_bw() + theme(legend.position="bottom",
        legend.spacing.x = unit(0, 'cm'))  + 
  guides(color = guide_legend(nrow = 3, byrow = T))

stormDataFiltered <- stormData %>% 
  group_by(EVTYPE) %>%
  dplyr::filter(YEAR > 1995,
                mean(CROPDMG) > 600) 
CROP <- stormData %>%
  dplyr::filter(YEAR > 1995) %>%
  ggplot() +
  geom_point(aes(YEAR, CROPDMG, colour = EVTYPE), 
             data = stormDataFiltered, size = 1) + 
  geom_line(aes(YEAR, CROPDMG, group = EVTYPE), 
            colour = alpha("grey", 0.7), size = 0.7) +
  geom_line(aes(YEAR, CROPDMG, colour = EVTYPE), 
            data = stormDataFiltered, size = 1) +
  xlab('') + ylab('Crop Damage Amount') +
  scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x),
              labels = trans_format("log10", math_format(10^.x))) +
  scale_x_continuous(breaks = seq(1996, 2011, 2)) +
  theme_bw() + theme(legend.position="bottom",
        legend.spacing.x = unit(0, 'cm')) + 
  guides(color = guide_legend(nrow = 3, byrow = T))

figure <- ggarrange(PROP, CROP, nrow = 1)
annotate_figure(figure,
                top = text_grob('Top 10 Property and Crop Damage Amount per Year caused by Weather Events'),
                bottom = text_grob("Data source: \n NOAA Storm database", color = "blue", 
                                   hjust = 1, x = 1, face = "italic", size = 10))

We can deduce from the graphical analysis:

Except for wildfire, all events in the top 10 of property damage have a constant rate each year and low variability, indicating some control. The fact that wildfire has high variability and has a high growth trend may be due to the high heat observed in the previous graph.

Regarding crop damage, all events appear to have certain control, that is, they have low variability around a constant average over the years. However, the river flood event has high variability and is apparently out of control.

Until next time,

Daniel Amaral