NOAA Storm Dataset Analysis

Natural disasters, such as storms and severe weather events, are a source calamity from a economic and social perspective. Analyzing the United States (U.S.) National Oceanic and Atmospheric Administration’s (NOAA) storm database, first, I inspect the evolution of the total number of servere events in U.S. over the 1950-2011 period, as well as analyze how the occurrency of the events differ across the months. After that, I analyze the most recurrent events in U.S. during the available period and their impact over the population health. Additionally, I show the most harmful events regarding economic impact in the U.S..

Data Processing:

For the analysis, I will be using data from NOAA storm database, which can be downloaded here.

Once the dataset is downloaded, we can unzip the file. As I am using a Linux machine, I set the argument unzip = /usr/bin/bunzip2 to specify the method of unziping to use in the following command:

unzip(zipfile = "repdata%2Fdata%2FStormData.csv.bz2", 
      unzip = "/usr/bin/bunzip2")

Now the the dataset is finally downloaded and unziped, we are ready to read it:

library(data.table)
library(dplyr)
foo <- fread("storm_data")
foo %>% head

##    STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1:       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
## 2:       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL
## 3:       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL
## 4:       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL
## 5:       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL
## 6:       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL
##     EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1: TORNADO         0                                               0
## 2: TORNADO         0                                               0
## 3: TORNADO         0                                               0
## 4: TORNADO         0                                               0
## 5: TORNADO         0                                               0
## 6: TORNADO         0                                               0
##    COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1:         NA         0                      14.0   100 3   0          0
## 2:         NA         0                       2.0   150 2   0          0
## 3:         NA         0                       0.1   123 2   0          0
## 4:         NA         0                       0.0   100 2   0          0
## 5:         NA         0                       0.0   150 2   0          0
## 6:         NA         0                       1.5   177 2   0          0
##    INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1:       15    25.0          K       0                                    
## 2:        0     2.5          K       0                                    
## 3:        2    25.0          K       0                                    
## 4:        2     2.5          K       0                                    
## 5:        2     2.5          K       0                                    
## 6:        6     2.5          K       0                                    
##    LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1:     3040      8812       3051       8806              1
## 2:     3042      8755          0          0              2
## 3:     3340      8742          0          0              3
## 4:     3458      8626          0          0              4
## 5:     3412      8642          0          0              5
## 6:     3450      8748          0          0              6

Now let’s perform the data cleaning process. First, I transform the Date variable into using the lubridate package and then create the variables year and month:

library(lubridate)
foo <- foo %>% 
    mutate(BGN_DATE = mdy_hms(BGN_DATE)) %>% 
    as.data.table()

## CREATE YEAR AND MONTH VARIABLES::
foo <- foo %>% 
    mutate(year = lubridate::year(BGN_DATE),
           month = lubridate::month(BGN_DATE, label = T)) %>% 
    as.data.table()

foo %>% head

##    STATE__   BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE  EVTYPE
## 1:       1 1950-04-18     0130       CST     97     MOBILE    AL TORNADO
## 2:       1 1950-04-18     0145       CST      3    BALDWIN    AL TORNADO
## 3:       1 1951-02-20     1600       CST     57    FAYETTE    AL TORNADO
## 4:       1 1951-06-08     0900       CST     89    MADISON    AL TORNADO
## 5:       1 1951-11-15     1500       CST     43    CULLMAN    AL TORNADO
## 6:       1 1951-11-15     2000       CST     77 LAUDERDALE    AL TORNADO
##    BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## 1:         0                                               0         NA
## 2:         0                                               0         NA
## 3:         0                                               0         NA
## 4:         0                                               0         NA
## 5:         0                                               0         NA
## 6:         0                                               0         NA
##    END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES
## 1:         0                      14.0   100 3   0          0       15
## 2:         0                       2.0   150 2   0          0        0
## 3:         0                       0.1   123 2   0          0        2
## 4:         0                       0.0   100 2   0          0        2
## 5:         0                       0.0   150 2   0          0        2
## 6:         0                       1.5   177 2   0          0        6
##    PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE
## 1:    25.0          K       0                                         3040
## 2:     2.5          K       0                                         3042
## 3:    25.0          K       0                                         3340
## 4:     2.5          K       0                                         3458
## 5:     2.5          K       0                                         3412
## 6:     2.5          K       0                                         3450
##    LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM year month
## 1:      8812       3051       8806              1 1950   Apr
## 2:      8755          0          0              2 1950   Apr
## 3:      8742          0          0              3 1951   Feb
## 4:      8626          0          0              4 1951   Jun
## 5:      8642          0          0              5 1951   Nov
## 6:      8748          0          0              6 1951   Nov

Now, I transform the characters in EVTYPE variable to lower case and remove any special character and extra spaces:

library(stringr)

## TO LOWECASE:
foo <- foo %>% 
    mutate(EVTYPE = str_to_lower(EVTYPE)) %>% 
    as.data.table()

## REMOVE SPECIAL CHARACTERS:
foo <- foo %>% 
    mutate(EVTYPE = str_replace_all(string = EVTYPE, pattern = "[[:punct:]]", replacement = " ")) %>% 
    as.data.table()

## TRIM EXTRA BLANK SPACES:
foo <- foo %>% 
    mutate(EVTYPE = trimws(x = EVTYPE, which = "both")) %>% 
    as.data.table()

The EVTYPE variable contains information for the kind of event happened. According to the Storm Data Documentation, there are 48 event type labels. Inspecting this variable closer, I notice some typos. In order to clean this variable, I try to group each classification in one of the 48 pre-classified groups.

event_types <- c("Astronomical Low Tide", "Avalanche", "Blizzard", "Coastal Flood", "Cold Wind Chill",
  "Debris Flow", "Dense Fog", "Dense Smoke", "Drought", "Dust Devil",
  "Dust Storm",  "Excessive Heat", "Extreme Cold Wind Chill", "Flash Flood",
  "Flood", "Freezing Fog", "Frost Freeze", "Funnel Cloud", "Hail", "Heat",
  "Heavy Rain", "Heavy Snow", "High Surf", "High Wind", "Hurricane Typhoon",
  "Ice Storm", "Lakeshore Flood", "Lake Effect Snow", "Lightning", "Marine Hail",
  "Marine High Wind", "Marine Strong Wind", "Marine Thunderstorm Wind", "Rip Current",
  "Seiche", "Sleet", "Storm Surge Tide", "Strong Wind", "Thunderstorm Wind", "Tornado",
  "Tropical Depression", "Tropical Storm", "Tsunami", "Volcanic Ash", "Waterspout", 
  "Wildfire", "Winter Storm", "Winter Weather")


for( i in event_types){ 
    
    index <- agrep(pattern = i, x = foo$EVTYPE, ignore.case = T)
    if( length(index) == 0 ) next
    
    foo[index, EVTYPE := i]
    
}

After that, I still needed to padronize some of the classification:

foo <- foo %>% 
    mutate(EVTYPE = ifelse(EVTYPE == "tstm wind", "Thunderstorm Wind", EVTYPE)) %>% 
    as.data.table()


foo <- foo %>% 
    mutate(EVTYPE = ifelse(EVTYPE == "marine tstm wind", "Marine Thunderstorm Wind", EVTYPE)) %>% 
    as.data.table()

foo <- foo %>% 
    mutate(EVTYPE = ifelse(EVTYPE == "wild forest fire", "Wildfire", EVTYPE)) %>% 
    as.data.table()

foo <- foo %>% 
    mutate(EVTYPE = ifelse(EVTYPE == "extreme cold", "Extreme Cold Wind Chill", EVTYPE)) %>% 
    as.data.table()

foo <- foo %>% 
    mutate(EVTYPE = ifelse(EVTYPE == "snow", "Heavy Snow", EVTYPE)) %>% 
    as.data.table()

foo <- foo %>% 
    mutate(EVTYPE = ifelse(EVTYPE == "landslide", "Debris Flow", EVTYPE)) %>% 
    as.data.table()


foo <- foo %>% 
    mutate(EVTYPE = ifelse(EVTYPE == "hurricane", "Hurricane Typhoon", EVTYPE)) %>% 
    as.data.table()


foo <- foo %>% 
    mutate(EVTYPE = ifelse(EVTYPE == "storm surge", "Storm Surge Tide", EVTYPE)) %>% 
    as.data.table()

Now I have a dataset that is closer to the 48 pre-classified groups, although there are still some labels not defined in the documentation. I choose to ignore this typos as they do not have a significant number to affect this analysis.

Now let’s pay attention to the variables CROPDMGEXP and PROPDMGEXP. As the documentation states that Alphabetical characters used to signify magnitude include “K” for thousands, “M” for millions, and “B” for billions. Hence, I tranform this variables accordingly:

foo$CROPDMGEXP <- str_to_upper(foo$CROPDMGEXP)
foo$PROPDMGEXP <- str_to_upper(foo$PROPDMGEXP)

economic_impact <- foo %>% 
    mutate(CROPDMGEXP = case_when(CROPDMGEXP == "K" ~ 1000,
                                  CROPDMGEXP == "M" ~ 1000000,
                                  CROPDMGEXP == "B" ~ 1000000000,
                                  TRUE ~ 0),
           PROPDMGEXP = case_when(PROPDMGEXP == "K" ~ 1000,
                                  PROPDMGEXP == "M" ~ 1000000,
                                  PROPDMGEXP == "B" ~ 1000000000,
                                  TRUE ~ 0)) %>% 
    select(EVTYPE, CROPDMG, CROPDMGEXP, PROPDMG, PROPDMGEXP) %>% 
    as.data.table()

Results:

Let’s take a look at the evolution of the number of severe events during the 1950-2011 period in U.S.:

storm_by_year <- foo[, .N, by = .(year)] %>% arrange(year)

library(ggplot2)
## Number of severe events by year:
ggplot(data = storm_by_year) + aes(x = year, y = N) + 
    geom_line(size = 0.5) + 
    geom_point(size = 1) + 
    theme_minimal() + 
    scale_x_continuous(breaks = seq(min(storm_by_year$year), max(storm_by_year$year), by = 10)) +
    labs(title = "Number of Severe Events (1950-2011)", 
         y = "Occurrency", x = "Year") + 
    theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"), 
          axis.text = element_text(size = 10),
          axis.title = element_text(size = 12))

As we can see from the previous plot, the number of registered severe events surge after 1990. This is an interesting evidence, however we can not state that the from 1990 we observe more events. It states that the number of registered servere events is more pronounced during this period.

Now let’s investigate how these events spread across the months.

storm_by_month <- foo[, .N, by = .(month)] %>% arrange(month)

storm_by_month <- storm_by_month %>% 
    mutate(month = factor(month, levels = month)) %>% 
    as.data.table()

ggplot(data = storm_by_month) + aes(x = month, y = N) + 
    geom_bar(stat = "identity", fill = "red") +
    labs(title = "Severe Events by Month (1950-2011)", 
         y = "Occurrency", x = "Year") + 
    theme_minimal() + 
    theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"), 
          axis.text = element_text(size = 10),
          axis.title = element_text(size = 12))

The histogram shows that the majority of the events concentrate among the April and August, which means that during these months the frequency to face this events is higher.

Once, I show the number of registered events has been growing during the 1950-2011 period and these events concentrate during a particular months, one may ask what is the most recurrent events in the country. The following table can give as the answer:

library(kableExtra)
top_10_event <- foo %>% 
    group_by(EVTYPE) %>% 
    summarise(N = n()) %>% 
    arrange(-N) %>% 
    as.data.table() %>% 
    head(10)

top_10_event %>% 
    kable() %>% 
    kable_styling(bootstrap_options = "striped", full_width = F)

EVTYPE	N
Thunderstorm Wind	329330
Hail	292207
Flood	82734
Tornado	60699
Heat	38755
High Wind	21923
Lightning	15777
Winter Storm	11436
Funnel Cloud	6938
Marine Thunderstorm Wind	6175

The previous table shows the top 10 most recurrent severe events in U.S. As we can see, Thunderstorm Wind is the most recurrent event, followed by Hail and Flood. But, how harmful are these events for the country?

The following figure shows the most harmful events for population health. Here, I use the number of fatalities and injuries as a proxy for population health.

top_10_harmful <- foo %>% 
    group_by(EVTYPE) %>% 
    summarise(fatalities = sum(FATALITIES + INJURIES)) %>% 
    arrange(-fatalities) %>% 
    as.data.table() %>% 
    head(10)

top_10_harmful <- top_10_harmful %>% 
    mutate(EVTYPE = factor(EVTYPE, levels = EVTYPE)) %>% 
    as.data.table()

levels(top_10_harmful$EVTYPE) <- gsub(pattern = " ", replacement = "\n", x = levels(top_10_harmful$EVTYPE))

ggplot(data = top_10_harmful) + 
    aes(x = as.factor(EVTYPE), y = fatalities/1000) + 
    geom_bar(stat = "identity", width = .8, fill = "red") + 
    theme_minimal() + 
    labs(y = "Number of Fatalities and Injuries", x = "Event Type",
         title = "Top 10 most harmful events across U.S. (in thousands) ") +
    theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"), 
          axis.text = element_text(size = 10),
          axis.title = element_text(size = 12))

As we can see, Tornado is the event that cause more fatalities and/or injuries in America, followed by Heat and Flood. Also, we can notice that 80% of the most harmful events are among the top 10 most recurrent events in the country. That is, these events tend to occur more frequently and cause more damages to the population health.

Now, let’s investigate the severe events that have the most economic impact across US. Here, I use the variables CROPDMG and PROPDMG as proxies for economic impact, as these variables describe Crop and property damage, respectively. Then, I multiply these column by CROPDMGEXP and PROPDMGEXP, which are the representative scale. Finally I add up by eventtypes. The following figure shows the results:

economic_impact <- economic_impact %>% 
    mutate(CROPDMG = CROPDMG * CROPDMGEXP,
           PROPDMG = PROPDMG * PROPDMGEXP) %>% 
    select(EVTYPE, PROPDMG, CROPDMG) %>% 
    as.data.table()


economic_impact <- economic_impact[, .(economic_impact = sum(PROPDMG + CROPDMG)), by = .(EVTYPE)] %>% 
    arrange(-economic_impact, EVTYPE) %>% head(10)
    

economic_impact <- economic_impact %>% 
    arrange(economic_impact) %>% 
    mutate(economic_impact = economic_impact/1000000000,
           EVTYPE = factor(EVTYPE, levels = EVTYPE)) %>% 
    as.data.table()

ggplot(data = economic_impact) + 
    aes(x = EVTYPE, y = economic_impact) + 
    geom_bar(stat = "identity", fill = "red") + 
    coord_flip() + 
    theme_minimal() + 
    labs(y = "", x = "Event Type",
         title = "Economic damage caused by severe events 1950-2011 (in billion U.S. dollars)
",
         subtitle = "") +
    theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"), 
          axis.text = element_text(size = 10),
          axis.title = element_text(size = 12))

From previous figure, we can that Flood, Hurricane/typhons and Tornado cause the most economic impact in the country during the period. Additionally, we notice that 4 of the most economic damage events are among the top 10 most recurrrent and poulation health damage.

Conclusion

Natural disasters, such as storms and severe weather events, are a source calamity from a economic and social perspective. In this small exercise, I analyze the United States (U.S.) National Oceanic and Atmospheric Administration’s (NOAA) storm database. The results show that the number of registered severe events increased over the year, with a surge after the 1990. Also, within years, the events tend to concentrate between April and August. Finally, I showed that almost half of the events that have the most economic and population health damages in the country are among the most recurrent events during 1950-2011 period.

NOAA Storm Dataset Analysis

Cleyton Farias

Data Processing:

Results:

Conclusion