Synopsis

In this report, we aim to analyze the impact of different weather events from the storm database collected from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) from 1950 - 2011. We will use the estimates of fatalities, injuries, property and crop damage to decide which types of event are most harmful to the population health and economy. From these data, we found out that EXCESSIVE HEAT and TORNADO are most harmful with respect to population health, while FLOOD and DROUGHT have the greatest economic consequences.

Data Processing

From the Storm Data we obtained data of the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. The events in the database start in the year 1950 and end in November 2011.

library(R.utils)
library(data.table)
library(dplyr)
library(tidyr)
library(ggplot2)
library(stringdist)
library(cowplot)
rawfile <- 'repdata_data_StormData.csv.bz2'
if (!file.exists(rawfile)) {
    url <- 'https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2'
    download.file(url, destfile = rawfile, method = 'curl')
}
Data <- fread(rawfile) %>% tibble::as_tibble() 
Data
## # A tibble: 902,297 x 37
##    STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE BGN_RANGE
##      <dbl> <chr>    <chr>    <chr>      <dbl> <chr>      <chr> <chr>      <dbl>
##  1       1 4/18/19~ 0130     CST           97 MOBILE     AL    TORNA~         0
##  2       1 4/18/19~ 0145     CST            3 BALDWIN    AL    TORNA~         0
##  3       1 2/20/19~ 1600     CST           57 FAYETTE    AL    TORNA~         0
##  4       1 6/8/195~ 0900     CST           89 MADISON    AL    TORNA~         0
##  5       1 11/15/1~ 1500     CST           43 CULLMAN    AL    TORNA~         0
##  6       1 11/15/1~ 2000     CST           77 LAUDERDALE AL    TORNA~         0
##  7       1 11/16/1~ 0100     CST            9 BLOUNT     AL    TORNA~         0
##  8       1 1/22/19~ 0900     CST          123 TALLAPOOSA AL    TORNA~         0
##  9       1 2/13/19~ 2000     CST          125 TUSCALOOSA AL    TORNA~         0
## 10       1 2/13/19~ 2000     CST           57 FAYETTE    AL    TORNA~         0
## # ... with 902,287 more rows, and 28 more variables: BGN_AZI <chr>,
## #   BGN_LOCATI <chr>, END_DATE <chr>, END_TIME <chr>, COUNTY_END <dbl>,
## #   COUNTYENDN <lgl>, END_RANGE <dbl>, END_AZI <chr>, END_LOCATI <chr>,
## #   LENGTH <dbl>, WIDTH <dbl>, F <int>, MAG <dbl>, FATALITIES <dbl>,
## #   INJURIES <dbl>, PROPDMG <dbl>, PROPDMGEXP <chr>, CROPDMG <dbl>,
## #   CROPDMGEXP <chr>, WFO <chr>, STATEOFFIC <chr>, ZONENAMES <chr>,
## #   LATITUDE <dbl>, LONGITUDE <dbl>, LATITUDE_E <dbl>, LONGITUDE_ <dbl>,
## #   REMARKS <chr>, REFNUM <dbl>

Cleaning Storm Event Types

Below are full list of official event types at Storm Events Database. We use amatch function from stringdist package for Approximate String Matching and try maxDist = 5 to match the EVTYPE in dataset.

EventType = c(
    'ASTRONOMICAL LOW TIDE', 'AVALANCHE', 'BLIZZARD', 'COASTAL FLOOD',
    'COLD/WIND CHILL', 'DEBRIS FLOW', 'DENSE FOG', 'DENSE SMOKE',
    'DROUGHT', 'DUST DEVIL', 'DUST STORM', 'EXCESSIVE HEAT',
    'EXTREME COLD/WIND CHILL', 'FLASH FLOOD', 'FLOOD', 'FREEZING FOG',
    'FROST/FREEZE', 'FUNNEL CLOUD', 'HAIL', 'HEAT',
    'HEAVY RAIN', 'HEAVY SNOW', 'HIGH SURF', 'HIGH WIND',
    'HURRICANE (TYPHOON)', 'ICE STORM', 'LAKE-EFFECT SNOW', 'LAKESHORE FLOOD',
    'LIGHTNING', 'MARINE HAIL', 'MARINE HIGH WIND', 'MARINE STRONG WIND',
    'MARINE THUNDERSTORM WIND', 'RIP CURRENT', 'SEICHE', 'SLEET',
    'SNEAKERWAVE', 'STORM SURGE/TIDE', 'STRONG WIND', 'THUNDERSTORM WIND',
    'TORNADO', 'TROPICAL DEPRESSION', 'TROPICAL STORM', 'TSUNAMI',
    'VOLCANIC ASH', 'WATERSPOUT', 'WILDFIRE', 'WINTER STORM', 'WINTER WEATHER'
)
StormData <- Data %>%
    mutate(
        EventType = EventType[amatch(toupper(EVTYPE), EventType, maxDist = 5)],
        Year = as.numeric(regmatches(BGN_DATE, regexpr('(\\d{4})', BGN_DATE)))
    )
mean(is.na(StormData$EventType))
## [1] 0.01258566

Because the proportion of missing values is very low (0.01258566), we choose to ignore missing values for now.

StormData <- StormData %>% drop_na(EventType)

Event Types Available

According to NOAA the data recording start from Jan. 1950. At that time they recorded one event type, tornado. They add more events gradually and only from Jan. 1996 they start recording all events type.

StormDataByType <- StormData %>%
    group_by(Year, EventType) %>%
    summarise(Count = n(), .groups = 'drop')
ggplot(data = StormDataByType, mapping = aes(x = Year, color = EventType)) +
    geom_line(aes(y = Count, group = EventType)) + 
    geom_vline(xintercept = 1996, linetype = 'dashed') +
    scale_x_continuous(n.breaks = 20) +
    ylab('Number of Events') +
    ggtitle('Total Number of Weather Events in the U.S. 1950 - 2011') +
    theme(legend.position = 'none')

Based on the above figure, it is good to consider only data from 1996 to 2011 to get more complete records. There are 644063 observations.

StormData <- StormData %>% filter(Year > 1995)
nrow(StormData)
## [1] 644063

Impact on Public Health

In this section, we check the number of fatalities and injuries that are caused by the severe weather events. We would like to get the first 15 most severe types of weather events.

Fatalities <- StormData %>%
    select(EventType, FATALITIES) %>%
    group_by(EventType) %>%
    summarise(TOTAL = sum(FATALITIES, na.rm = TRUE), .groups = 'drop') %>%
    arrange(desc(TOTAL))
summary(Fatalities)
##   EventType             TOTAL       
##  Length:47          Min.   :   0.0  
##  Class :character   1st Qu.:   1.5  
##  Mode  :character   Median :  33.0  
##                     Mean   : 177.5  
##                     3rd Qu.: 121.5  
##                     Max.   :1797.0
Injuries <- StormData %>%
    select(EventType, INJURIES) %>%
    group_by(EventType) %>%
    summarise(TOTAL = sum(INJURIES, na.rm = TRUE), .groups = 'drop') %>%
    arrange(desc(TOTAL))
summary(Injuries)
##   EventType             TOTAL        
##  Length:47          Min.   :    0.0  
##  Class :character   1st Qu.:    3.0  
##  Mode  :character   Median :  143.0  
##                     Mean   : 1207.1  
##                     3rd Qu.:  726.5  
##                     Max.   :20667.0

Impact on Economy

We create data EXPVALUE to store the multipliers (B or b = Billion, M or m = Million, K or k = Thousand, H or h = Hundred, + for 1 and the numbers from 0 to 8 represent the Ten).

EXPVALUE = data.frame(
    EXP = c('H', 'h', 'K', 'k', 'M', 'm', 'B', 'b', '+', 0:8),
    COEFF = c(1e2, 1e2, 1e3, 1e3, 1e6, 1e6, 1e9, 1e9, 1, rep(10, 9))
)

The CROPDMGEXP is the exponent values for CROPDMG (crop damage). In the same way, PROPDMGEXP is the exponent values for PROPDMG (property damage). We join the StormData with EXPVALUE to get the multipliers and hence get to final amounts.

PropertyDamage <- StormData %>%
    left_join(EXPVALUE %>% select(PROPDMGEXP = EXP, COEFF)) %>%
    mutate(DAMAGE = PROPDMG * COEFF) %>%
    select(EventType, DAMAGE) %>%
    group_by(EventType) %>%
    summarise(DAMAGE = sum(DAMAGE, na.rm = TRUE), .groups = 'drop') %>%
    arrange(desc(DAMAGE))
## Joining, by = "PROPDMGEXP"
summary(PropertyDamage)
##   EventType             DAMAGE         
##  Length:47          Min.   :7.400e+04  
##  Class :character   1st Qu.:1.864e+06  
##  Mode  :character   Median :1.048e+07  
##                     Mean   :7.475e+09  
##                     3rd Qu.:1.289e+09  
##                     Max.   :1.446e+11
CropDamage <- StormData %>%
    left_join(EXPVALUE %>% select(CROPDMGEXP = EXP, COEFF)) %>%
    mutate(DAMAGE = CROPDMG * COEFF) %>%
    select(EventType, DAMAGE) %>%
    group_by(EventType) %>%
    summarise(DAMAGE = sum(DAMAGE, na.rm = TRUE), .groups = 'drop') %>%
    arrange(desc(DAMAGE))
## Joining, by = "CROPDMGEXP"
summary(CropDamage)
##   EventType             DAMAGE         
##  Length:47          Min.   :0.000e+00  
##  Class :character   1st Qu.:0.000e+00  
##  Mode  :character   Median :1.211e+06  
##                     Mean   :6.446e+08  
##                     3rd Qu.:2.894e+08  
##                     Max.   :1.337e+10

Results

In order to show which types of events are most harmful with respect to population health, we make barplots over top 15 events that caused largest total fatalities and total injuries.

FatalitiesFigure <-
    ggplot(data = head(Fatalities, 15), mapping = aes(x = reorder(EventType, -TOTAL), y = TOTAL)) +
    geom_bar(stat = 'identity', mapping = aes(fill = EventType)) +
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
    labs(x = 'Severe Weather Type', y = 'Number of Fatalities') +
    ggtitle('Total Fatalities by Severe Weather Events\n in the U.S. from 1996 - 2011') +
    theme(legend.position = 'none')
Injuriesigure <-
    ggplot(data = head(Injuries, 15), mapping = aes(x = reorder(EventType, -TOTAL), y = TOTAL)) +
    geom_bar(stat = 'identity', mapping = aes(fill = EventType)) +
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
    labs(x = 'Severe Weather Type', y = 'Number of Injuries') +
    ggtitle('Total Injuries by Severe Weather Events\n in the U.S. from 1996 - 2011') +
    theme(legend.position = 'none')
plot_grid(FatalitiesFigure, Injuriesigure, ncol = 2)

Based on the above plots, we find that EXCESSIVE HEAT and TORNADO cause most fatalities; TORNATO causes most injuries in the United States from 1996 to 2011.

As for the impact on economic consequences, we have got two sorted lists below by the amount of money cost by damages.

PropertyFigure <-
    ggplot(data = head(PropertyDamage, 15), mapping = aes(x = reorder(EventType, -DAMAGE), y = DAMAGE / 1e9)) +
    geom_bar(stat = 'identity', mapping = aes(fill = EventType)) +
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
    labs(x = 'Severe Weather Type', y = 'Property Damage (in US billion dollars)') +
    ggtitle('Total Property Damage by Severe Weather Events\n in the U.S. from 1996 - 2011') +
    theme(legend.position = 'none')
CropFigure <-
    ggplot(data = head(CropDamage, 15), mapping = aes(x = reorder(EventType, -DAMAGE), y = DAMAGE / 1e9)) +
    geom_bar(stat = 'identity', mapping = aes(fill = EventType)) +
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
    labs(x = 'Severe Weather Type', y = 'Crop Damage (in US billion dollars)') +
    ggtitle('Total Crop Damage by Severe Weather Events\n in the U.S. from 1996 - 2011') +
    theme(legend.position = 'none')
plot_grid(PropertyFigure, CropFigure, ncol = 2)

Based on the above plots, we find that FLOOD cause most property damage; DROUGHT causes most crop damage in the United States from 1996 to 2011.