storm_data_analysis.utf8

Top Severe Weather Phenomena in the United States (1996-2011)

Synopsis

In this simple analysis we aim to identify which meteorological phenomenon is the most detrimental to population health and which has the most impact with regards to economic repercussions across the United States from 1996 to 2011. To do such, we obtained data from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database as collected by the National Weather Service(NWS). Specifically, we looked at fatalities and injuries as significant factors that are harmful to population health, whereas property and crop damage was used to discern event/s that have the greatest economic consequences.

Data Processing

We obtained data from version 3 of NOAA’s storm events database which contains consolidated weather events from 1950 to 2011.

Reading the data

We read in the data from a comma-separated-value file compressed via the bzip2 algorithm to reduce its size.

> library(magrittr)
> library(data.table)
> library(tidyverse)
> library(ggthemes)
> library(RColorBrewer)
> 
> 
> url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
> df <- fread(url)

After reading in, we check the dimensions of our data table.

> dim(df)

[1] 902297     37

Apparently, the data has 902297 rows and 37 columns. Note that each observation refers to a specific storm episode with each column describing specific details about the event. As we can see, working with this much data could easily obscure our analysis so we first check out the column labels.

> colnames(df)

 [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
 [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
[11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
[16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
[21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
[26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
[31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
[36] "REMARKS"    "REFNUM"

Information about these variables can be accessed here. Moreover, details about the accuracy of the data are discussed here. Also, the data contains information from 1950 to 2011, but we will include only data from 1996 to 2011 for our analysis because of the following statements:

`` 1. Tornado: From 1950 through 1954, only tornado events were recorded.

Tornado, Thunderstorm Wind and Hail: From 1955 through 1992, only tornado, thunderstorm wind and hail events were keyed from the paper publications into digital data. From 1993 to 1995, only tornado, thunderstorm wind and hail events have been extracted from the Unformatted Text Files.

All Event Types (48 from Directive 10-1605): From 1996 to present, 48 event types are recorded as defined in NWS Directive 10-1605.``

Source: https://www.ncdc.noaa.gov/stormevents/details.jsp?type=eventtype

Specifically, we will look at these columns;

Variable	Description
BGN_DATE	The date at which the weather event started
EVTYPE	The type of weather event (e.g. Hail, Thunderstorm Wind, etc)
FATALITIES	The number of deaths related to the weather event
INJURIES	The number of injuries related to the weather event
PROPDMG	The estimated amount of damage to property incurred by the weather event (e.g. 10.00K = $10,000 ; 10.00M = $10,000,000)
PROPDMGEXP	Unit of measure or multiplier for PROPDMG value (e.g. K = $1,000; M = $1,000,000)
CROPDMG	The estimated amount of damage to crops incurred by the weather event (e.g. 10.00K = $10,000; 10.00M = $10,000,000)
CROPDMGEXP	Unit of measure or multiplier for CROPDMG value (e.g. K = $1,000; M = $1,000,000)

Transforming the data

We filter the data using only the aforementioned variables. In addition, we exclude the data that have zero values on all of these variables since we are only interested at the event that is most detrimental to health and economy.

> storm_data <- df %>%
+   select(
+     BGN_DATE, EVTYPE, FATALITIES, INJURIES,
+     PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP
+   ) %>%
+   filter(
+     FATALITIES != 0 | INJURIES != 0 | PROPDMG != 0 | CROPDMG != 0
+   )
> 
> dim(storm_data)

[1] 254633      8

The new data table gives us 254633 rows and 8 columns. We look at the first few rows of the dataset.

> storm_data <- as_tibble(storm_data)
> head(storm_data)

# A tibble: 6 x 8
  BGN_DATE      EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
  <chr>         <chr>       <dbl>    <dbl>   <dbl> <chr>        <dbl> <chr>     
1 4/18/1950 0:~ TORNA~          0       15    25   K                0 ""        
2 4/18/1950 0:~ TORNA~          0        0     2.5 K                0 ""        
3 2/20/1951 0:~ TORNA~          0        2    25   K                0 ""        
4 6/8/1951 0:0~ TORNA~          0        2     2.5 K                0 ""        
5 11/15/1951 0~ TORNA~          0        2     2.5 K                0 ""        
6 11/15/1951 0~ TORNA~          0        6     2.5 K                0 ""

> head(storm_data$BGN_DATE)

[1] "4/18/1950 0:00:00"  "4/18/1950 0:00:00"  "2/20/1951 0:00:00" 
[4] "6/8/1951 0:00:00"   "11/15/1951 0:00:00" "11/15/1951 0:00:00"

We convert BGN_DATE to datetime object and then extract the year. We filter the data for the years 1996-2011.

> storm_data$BGN_DATE <-
+   gsub(
+     pattern = "(.*)\\s(.*)",
+     replacement = "\\1",
+     storm_data$BGN_DATE
+   ) %>%
+   as.Date(format = "%m/%d/%Y")
> 
> storm_data <-
+   storm_data %>%
+   filter(year(BGN_DATE) %in% 1996:2011)
> 
> storm_data$BGN_DATE %>%
+   year() %>%
+   unique() %>%
+   sort()

 [1] 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
[16] 2011

The next step is to merge PROPDMG with their corresponding unit/multiplier, PROPDMGEXP, as well as CROPDMG with CROPDMGEXP. Here we will explore the contents of PROPDMGEXP and CROPDMGEXP variables.

> unique(storm_data$PROPDMGEXP)

[1] "K" ""  "M" "B"

> unique(storm_data$CROPDMGEXP)

[1] "K" ""  "M" "B"

An effort was made to determine what these values imply at which more details can be found here. With that, we can assure the validity of the following association:

H,h = hundreds = 100
K,k = kilos = thousands = 1,000
M,m = millions = 1,000,000
B,b = billions = 1,000,000,000
black/empty character = 0

We then change these values with their corresponding actual value.

> storm_data$PROPDMGEXP <-
+   case_when(
+     grepl("[Hh]", storm_data$PROPDMGEXP) ~ 10^2,
+     grepl("[Kk]", storm_data$PROPDMGEXP) ~ 10^3,
+     grepl("[Mm]", storm_data$PROPDMGEXP) ~ 10^6,
+     grepl("[Bb]", storm_data$PROPDMGEXP) ~ 10^9,
+     grepl("(^$)", storm_data$PROPDMGEXP) ~ 0,
+   )
> 
> unique(storm_data$PROPDMGEXP)

[1] 1e+03 0e+00 1e+06 1e+09

> storm_data$CROPDMGEXP <-
+   case_when(
+     grepl("[Kk]", storm_data$CROPDMGEXP) ~ 10^3,
+     grepl("[Mm]", storm_data$CROPDMGEXP) ~ 10^6,
+     grepl("[Bb]", storm_data$CROPDMGEXP) ~ 10^9,
+     grepl("(^$)", storm_data$CROPDMGEXP) ~ 0
+   )
> 
> unique(storm_data$CROPDMGEXP)

[1] 1e+03 0e+00 1e+06 1e+09

We now use PROPDMGEXP and CROPDMGEXP as multipliers for the values of property damages (PROPDMG) and crop damages (CROPDMG) respectively.

> storm_data <-
+   storm_data %>%
+   rename_with(tolower) %>%
+   mutate(
+     bgn_date = year(bgn_date),
+     propdmg = propdmg * as.double(propdmgexp),
+     cropdmg = cropdmg * as.double(cropdmgexp),
+   ) %>%
+   select(
+     year = bgn_date,
+     evtype, fatalities, injuries, propdmg, cropdmg
+   ) %>%
+   arrange(year)
> 
> storm_data

# A tibble: 201,318 x 6
    year evtype       fatalities injuries propdmg cropdmg
   <int> <chr>             <dbl>    <dbl>   <dbl>   <dbl>
 1  1996 WINTER STORM          0        0  380000   38000
 2  1996 TORNADO               0        0  100000       0
 3  1996 TSTM WIND             0        0    3000       0
 4  1996 TSTM WIND             0        0    5000       0
 5  1996 TSTM WIND             0        0    2000       0
 6  1996 HIGH WIND             0        0  400000       0
 7  1996 TSTM WIND             0        0   12000       0
 8  1996 TSTM WIND             0        0    8000       0
 9  1996 TSTM WIND             0        0   12000       0
10  1996 FLASH FLOOD           0        0   75000       0
# ... with 201,308 more rows

We can further tidy the data by fixing the discrepancies(e.g. typo errors) on the evtype column. Also, the number of official event types are only 48 however we have 222 unique events on this column.

> length(unique(storm_data$evtype))

[1] 222

Going through each of the event and manually fact-checking each on the database is one way to fix this column albeit a tedious task. Thus, for the purpose of this analysis we will only look at some patterns for the 20 most frequent events.

> table(storm_data$evtype) %>%
+   sort(decreasing = TRUE) %>%
+   head(n = 20)


           TSTM WIND    THUNDERSTORM WIND                 HAIL 
               61775                43097                22679 
         FLASH FLOOD              TORNADO            LIGHTNING 
               19011                12366                11152 
               FLOOD            HIGH WIND          STRONG WIND 
                9513                 5402                 3367 
        WINTER STORM           HEAVY RAIN           HEAVY SNOW 
                1460                 1047                 1029 
            WILDFIRE URBAN/SML STREAM FLD       EXCESSIVE HEAT 
                 847                  702                  685 
           ICE STORM       TSTM WIND/HAIL       TROPICAL STORM 
                 631                  441                  410 
      WINTER WEATHER     WILD/FOREST FIRE 
                 405                  381

After matching these events with the information on the database, it turns out that the following are exactly the same as the value on the right:

TSTM WIND -> THUNDERSTORM WIND
TSTM WIND/HAIL -> THUNDERSTORM WIND
THUNDERSTORM WINDS -> THUNDERSTORM WIND
HIGH WINDS -> HIGH WIND
URBAN/SML STREAM FLD -> FLOOD

We then alter these values. Moreover, we convert each element to title case and remove leading and trailing whitespace.

> storm_data$evtype <-
+   storm_data$evtype %>%
+   trimws() %>%
+   str_to_title() %<>%
+   str_replace("(.*)\\s\\W*\\d+\\W*", "\\1") %<>%
+   str_replace("(.*)\\s\\W*\\w{1}\\d+\\W*", "\\1") %>%
+   trimws() %<>%
+   gsub(
+     pattern = "Tstm Wind|Tstm Winds|Tstmw|Tstm Wind/Hail|Thunderstorm Winds",
+     replacement = "Thunderstorm Wind"
+   ) %<>%
+   gsub(
+     pattern = "Thunderstorm  Winds|Marine Tstm Wind|Thunderstorm Windss",
+     replacement = "Thunderstorm Wind"
+   ) %<>%
+   gsub(
+     pattern = "Thunderstorm Winds/Hail|Thunderstorms Winds",
+     replacement = "Thunderstorm Wind"
+   ) %<>%
+   gsub(
+     pattern = "Thunderstorm Wind/Hail|Thunderstorm Winds Hail",
+     replacement = "Thunderstorm Wind"
+   ) %<>%
+   gsub(
+     pattern = "High Winds", replacement = "High Wind"
+   ) %<>%
+   gsub(
+     pattern = "Urban/Sml Stream Fld|Flooding", replacement = "Flood"
+   ) %<>%
+   gsub(
+     pattern = "Flash Flooding", replacement = "Flash Flood"
+   ) %<>%
+   gsub(
+     pattern = "^.*/(Excessive Heat)", replacement = "\\1"
+   )
> 
> table(storm_data$evtype) %>%
+   sort(decreasing = TRUE) %>%
+   head(n = 20)


Thunderstorm Wind              Hail       Flash Flood           Tornado 
           105369             22679             19012             12366 
        Lightning             Flood         High Wind       Strong Wind 
            11152             10215              5405              3369 
     Winter Storm        Heavy Rain        Heavy Snow          Wildfire 
             1460              1047              1029               847 
   Excessive Heat         Ice Storm    Tropical Storm    Winter Weather 
              685               631               410               405 
 Wild/Forest Fire       Rip Current         Avalanche           Drought 
              381               364               264               258

Results

We divide our exploration into two parts. First, we look at the variables, fatalities and injuries to determine which event has the biggest impact on population health and then investigate propdmg and cropdmg with regards to economic loss.

Most Harmful on Population Health

We summarize the total fatalities and total injuries per weather event, sort the data in a descending order and then collect only the top five most harmful events.

> bad_on_pop_health_top5 <-
+   storm_data %>%
+   group_by(evtype) %>%
+   summarise(
+     total_deaths = sum(fatalities),
+     total_injuries = sum(injuries)
+   ) %>%
+   arrange(desc(total_deaths)) %>%
+   top_n(5)
> 
> bad_on_pop_health_top5

# A tibble: 5 x 3
  evtype            total_deaths total_injuries
  <chr>                    <dbl>          <dbl>
1 Excessive Heat            1797           6391
2 Tornado                   1511          20667
3 Lightning                  651           4141
4 Flood                      442           6837
5 Thunderstorm Wind          378           5128

It turns out that Excessive Heat events have the most fatalities with Tornado events close in second, whereas Tornado events have inflicted the most injuries.

We inspect fatalities and injuries columns graphically.

> bad_on_pop_health_top5 %>%
+   pivot_longer(-evtype) %>%
+   ggplot(aes(
+     x = reorder(evtype, value),
+     y = value,
+     fill = evtype
+   )) +
+   geom_bar(stat = "identity") +
+   theme_economist() +
+   scale_fill_brewer(palette = "Spectral") +
+   theme(legend.position = "none") +
+   labs(
+     x = "", y = "",
+     title = "Top 5 Weather Events (Injuries/Loss of life)"
+   ) +
+   facet_wrap(name ~ .,
+     ncol = 1,
+     scales = "free",
+     labeller = labeller(
+       name = c(
+         `total_injuries` = "Total Number of Injuries",
+         `total_deaths` = "Total Number of Deaths"
+       )
+     ),
+     strip.position = "bottom"
+   ) +
+   coord_flip()

For the purpose of this simple analysis, we can assume that Tornado is the most detrimental weather event to population health with respect to both the number of injuries and deaths inflicted.

Most Impact on Economy

We summarize the total crop damage and total property damage per weather event, sort the data in a descending order and then collect only the top five weather events.

> bad_on_economy_top5 <-
+   storm_data %>%
+   group_by(evtype) %>%
+   summarise(
+     total_crop_damage = sum(cropdmg),
+     total_property_damage = sum(propdmg)
+   ) %>%
+   arrange(desc(total_crop_damage)) %>%
+   top_n(5)
> 
> bad_on_economy_top5

# A tibble: 5 x 3
  evtype            total_crop_damage total_property_damage
  <chr>                         <dbl>                 <dbl>
1 Flood                    4983266500          144003143200
2 Hurricane/Typhoon        2607872800           69305840000
3 Flash Flood              1334901700           15222253910
4 Tornado                   283425010           24616945710
5 Storm Surge                    5000           43193536000

We then visually inspect these events.

> bad_on_economy_top5 %>%
+   pivot_longer(-evtype) %>%
+   ggplot(aes(
+     x = reorder(evtype, value),
+     y = value,
+     fill = evtype
+   )) +
+   geom_bar(stat = "identity") +
+   scale_y_continuous(trans = "log10",
+                      labels = c("100", "100K", "100M", "100B"),
+                      breaks = c(10^2, 10^5, 10^8, 10^11)) +
+   theme_economist() +
+   scale_fill_brewer(palette = "Spectral") +
+   theme(legend.position = "none") +
+   labs(
+     x = "", y = "",
+     title = "Top 5 Weather Events (Loss of Crops and Properties)"
+   ) +
+   facet_wrap(name ~ .,
+     ncol = 1,
+     labeller = labeller(
+       name = c(
+         `total_crop_damage` = "Total Amount of Damage to Crops",
+         `total_property_damage` = "Total Amount of Damage to Properties"
+       )
+     ),
+     strip.position = "bottom"
+   ) +
+   coord_flip()

Considering both crop and property damage, we can say that Flood events have been the most devastating.

Summary

Through exploratory data analysis, we were able to identify weather phenomena that have tremendous impact on population health and on economy. Tornado has been identified as the most harmful to population health in the U.S. from 1996 to 2011 with respect to both injuries and loss of life. On the other hand, Flood has been the most catastrophic with regards to economic loss due to crop and property damages.