Synopsis

Using data from the U.S. National Oceanic and Atmospheric Administration (NOAA), the following data analysis determines which weather events have been the most harmful to the US population from an economic and health perspective.

Setup

## Load packages
library(tidyverse)
library(ggplot2)

Data Processing

First, we need to load the data into R.

## Load data
url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
dir <- "/Users/kevinroche22/RData/SevereWeatherAnalysis"
setwd(dir)
download.file(url, "StormData.csv.bz2")
stormData <- read.csv(bzfile("StormData.csv.bz2"), header = TRUE, sep = ",")

Let’s take a look at what we’re working with.

## Check Data
dim(stormData)

## [1] 902297     37

head(stormData, 3)

##   STATE__          BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE  EVTYPE
## 1       1 4/18/1950 0:00:00     0130       CST     97     MOBILE    AL TORNADO
## 2       1 4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL TORNADO
## 3       1 2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL TORNADO
##   BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## 1         0                                               0         NA
## 2         0                                               0         NA
## 3         0                                               0         NA
##   END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG
## 1         0                      14.0   100 3   0          0       15    25.0
## 2         0                       2.0   150 2   0          0        0     2.5
## 3         0                       0.1   123 2   0          0        2    25.0
##   PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE
## 1          K       0                                         3040      8812
## 2          K       0                                         3042      8755
## 3          K       0                                         3340      8742
##   LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1       3051       8806              1
## 2          0          0              2
## 3          0          0              3

So, the data contains just over 900k observations of 37 different variables. It looks like some of the variables have missing data. Let’s see how many missing values each variable has.

## View proportions
stormData %>% summarise_each(funs(100*mean(is.na(.))))

##   STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE BGN_RANGE
## 1       0        0        0         0      0          0     0      0         0
##   BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN END_RANGE END_AZI
## 1       0          0        0        0          0        100         0       0
##   END_LOCATI LENGTH WIDTH        F MAG FATALITIES INJURIES PROPDMG PROPDMGEXP
## 1          0      0     0 93.49061   0          0        0       0          0
##   CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES    LATITUDE LONGITUDE LATITUDE_E
## 1       0          0   0          0         0 0.005208928         0 0.00443313
##   LONGITUDE_ REMARKS REFNUM
## 1          0       0      0

## View totals
stormData %>% summarise_each(funs(sum(is.na(.))))

##   STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE BGN_RANGE
## 1       0        0        0         0      0          0     0      0         0
##   BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN END_RANGE END_AZI
## 1       0          0        0        0          0     902297         0       0
##   END_LOCATI LENGTH WIDTH      F MAG FATALITIES INJURIES PROPDMG PROPDMGEXP
## 1          0      0     0 843563   0          0        0       0          0
##   CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE LATITUDE_E
## 1       0          0   0          0         0       47         0         40
##   LONGITUDE_ REMARKS REFNUM
## 1          0       0      0

Interesting. The “COUNTYENDN” variable is missing entirely, while the “F” variable is ~93% missing. A very small proportion of the latitude variables are also missing. Given that I’m only interested in the variables that have economic and health consequences, these missing values won’t be problematic.

Now, let’s take a subset of the data that only includes the variables that are of interest to me.

## Select Variables
stormData <- stormData %>% 
        select("BGN_DATE", "EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")

Some of these variables are related. The value for “PROPDMG” doesn’t represent the amount of property damage alone, it represents the coefficient - so it needs to be multiplied by the corresponding value from “PROPDMGEXP”. The same logic follows for “CROPDMG” and “CROPDMGEXP”.

The code below creates two new variables, “PROPERTYDMGS” and “CROPDMGS” that multiply the values for PROPDMG and CROPDMG by their corresponding exponent. Observations that have missing exponent values or erroneous exponent values (a very small amount of the exponents are labeled “?”, for example) are assigned a value of NA.

The original “PROPDMG,”PROPDMGEXP“,”CROPDMG“, and”CROPDMPEXP" columns are then dropped.

## Property Damage
stormData$PROPDMGEXP <- as.character(stormData$PROPDMGEXP)
stormData <- stormData %>% 
        mutate(PROPERTYDMGS = case_when(PROPDMGEXP == "K" ~ PROPDMG * 1000,
                                        PROPDMGEXP == "M" ~ PROPDMG * 1000000,
                                        PROPDMGEXP == "B" ~ PROPDMG * 1000000000))

## Crop Damage
stormData$CROPDMGEXP <- as.character(stormData$CROPDMGEXP)
stormData <- stormData %>% 
        mutate(CROPDMGS = case_when(CROPDMGEXP == "K" ~ CROPDMG * 1000,
                                    CROPDMGEXP == "M" ~ CROPDMG * 1000000,
                                    CROPDMGEXP == "B" ~ CROPDMG * 1000000000))

## Drop old property damage and crop damage columns
stormData <- stormData %>% 
        select(-PROPDMG, -PROPDMGEXP, -CROPDMG, -CROPDMGEXP)

## Format new columns as integers
stormData$CROPDMGS <- as.integer(stormData$CROPDMGS)
stormData$PROPERTYDMGS <- as.integer(stormData$PROPERTYDMGS)

The date variable, BGN_DATE, isn’t formatted as a date. Let’s fix that, and then extract the year from it.

## Format as date
stormData$BGN_DATE <- as.Date(stormData$BGN_DATE, "%m/%d/%Y")

## Create new variable, year
stormData <- stormData %>% 
        mutate(YEAR = str_extract(BGN_DATE, "^\\d{4}")) %>%
        select(-BGN_DATE)

## Format year variable
stormData$YEAR <- lubridate::ymd(stormData$YEAR, truncated = 2L)

The NOAA notes that prior to 1996, only four types of events were recorded - Hail, Wind, Thunderstorms, and Tornadoes. From 1996 onward, 48 event types are recorded. The data will be skewed towards the original four events if I use data from all time periods, so I’ll need to subset the data from 1996 onward to allow for proper comparison between event types.

## Filter on observations from 1996 onward
stormData <- stormData %>% 
        filter(YEAR >= "1996-01-01")

We only need events that resulted in either health or economic impacts, so we can filter out observations that don’t fall into either of those categories.

## Only keep events with health or economic impacts
stormData <- stormData %>% 
        filter(PROPERTYDMGS > 0 | CROPDMGS > 0 | FATALITIES > 0 | INJURIES > 0)

Let’s check the cleanliness of the event type data.

## Change case of all event types to upper
stormData$EVTYPE <- toupper(stormData$EVTYPE)
stormData %>% distinct(EVTYPE)

##                        EVTYPE
## 1                WINTER STORM
## 2                     TORNADO
## 3                   TSTM WIND
## 4                   HIGH WIND
## 5                 FLASH FLOOD
## 6               FREEZING RAIN
## 7                EXTREME COLD
## 8                   LIGHTNING
## 9                        HAIL
## 10                      FLOOD
## 11             TSTM WIND/HAIL
## 12             EXCESSIVE HEAT
## 13               RIP CURRENTS
## 14                      OTHER
## 15                 HEAVY SNOW
## 16           WILD/FOREST FIRE
## 17                  ICE STORM
## 18                   BLIZZARD
## 19                STORM SURGE
## 20       ICE JAM FLOOD (MINOR
## 21                 DUST STORM
## 22                STRONG WIND
## 23                 DUST DEVIL
## 24       URBAN/SML STREAM FLD
## 25                        FOG
## 26                 ROUGH SURF
## 27                 HEAVY SURF
## 28                 HEAVY RAIN
## 29            MARINE ACCIDENT
## 30                  AVALANCHE
## 31                     FREEZE
## 32             DRY MICROBURST
## 33                      WINDS
## 34              COASTAL STORM
## 35         EROSION/CSTL FLOOD
## 36             RIVER FLOODING
## 37                 WATERSPOUT
## 38            DAMAGING FREEZE
## 39                  HURRICANE
## 40             TROPICAL STORM
## 41              BEACH EROSION
## 42                  HIGH SURF
## 43       HEAVY RAIN/HIGH SURF
## 44          UNSEASONABLE COLD
## 45                EARLY FROST
## 46                 WINTRY MIX
## 47                    DROUGHT
## 48           COASTAL FLOODING
## 49        TORRENTIAL RAINFALL
## 50                  LANDSLUMP
## 51          HURRICANE EDOUARD
## 52             TIDAL FLOODING
## 53               STRONG WINDS
## 54          EXTREME WINDCHILL
## 55                      GLAZE
## 56              EXTENDED COLD
## 57                  WHIRLWIND
## 58          HEAVY SNOW SHOWER
## 59                 LIGHT SNOW
## 60              COASTAL FLOOD
## 61               MIXED PRECIP
## 62                       COLD
## 63             FREEZING SPRAY
## 64                  DOWNBURST
## 65                  MUDSLIDES
## 66                 MICROBURST
## 67                   MUDSLIDE
## 68                       SNOW
## 69               SNOW SQUALLS
## 70                WIND DAMAGE
## 71             LIGHT SNOWFALL
## 72           FREEZING DRIZZLE
## 73            GUSTY WIND/RAIN
## 74        GUSTY WIND/HVY RAIN
## 75                       WIND
## 76           COLD TEMPERATURE
## 77                  HEAT WAVE
## 78              COLD AND SNOW
## 79                  RAIN/SNOW
## 80            TSTM WIND (G45)
## 81                GUSTY WINDS
## 82                 GUSTY WIND
## 83               TSTM WIND 40
## 84               TSTM WIND 45
## 85                HARD FREEZE
## 86             TSTM WIND (41)
## 87                       HEAT
## 88                RIVER FLOOD
## 89            TSTM WIND (G40)
## 90                RIP CURRENT
## 91                  MUD SLIDE
## 92               FROST/FREEZE
## 93               SNOW AND ICE
## 94        AGRICULTURAL FREEZE
## 95             WINTER WEATHER
## 96                SNOW SQUALL
## 97                  ICY ROADS
## 98               THUNDERSTORM
## 99       HYPOTHERMIA/EXPOSURE
## 100          LAKE EFFECT SNOW
## 101       MIXED PRECIPITATION
## 102                 BLACK ICE
## 103              COASTALSTORM
## 104                 DAM BREAK
## 105              BLOWING SNOW
## 106                     FROST
## 107             GRADIENT WIND
## 108         UNSEASONABLY COLD
## 109   TSTM WIND AND LIGHTNING
## 110            WET MICROBURST
## 111       HEAVY SURF AND WIND
## 112              FUNNEL CLOUD
## 113                   TYPHOON
## 114                LANDSLIDES
## 115               HIGH SWELLS
## 116                HIGH WINDS
## 117                SMALL HAIL
## 118           UNSEASONAL RAIN
## 119  COASTAL FLOODING/EROSION
## 120           TSTM WIND (G45)
## 121          TSTM WIND  (G45)
## 122           HIGH WIND (G40)
## 123           TSTM WIND (G35)
## 124           COASTAL EROSION
## 125         UNSEASONABLY WARM
## 126                    SEICHE
## 127 COASTAL  FLOODING/EROSION
## 128     HYPERTHERMIA/EXPOSURE
## 129                ROCK SLIDE
## 130           GUSTY WIND/HAIL
## 131                HEAVY SEAS
## 132                 TSTM WIND
## 133                 LANDSPOUT
## 134               RECORD HEAT
## 135            EXCESSIVE SNOW
## 136         FLOOD/FLASH/FLOOD
## 137             WIND AND WAVE
## 138         FLASH FLOOD/FLOOD
## 139       LIGHT FREEZING RAIN
## 140                 ICE ROADS
## 141                 HIGH SEAS
## 142                      RAIN
## 143                ROUGH SEAS
## 144             TSTM WIND G45
## 145    NON-SEVERE WIND DAMAGE
## 146              WARM WEATHER
## 147   THUNDERSTORM WIND (G40)
## 148                 LANDSLIDE
## 149                HIGH WATER
## 150               FLASH FLOOD
## 151          LATE SEASON SNOW
## 152        WINTER WEATHER MIX
## 153                ROGUE WAVE
## 154          FALLING SNOW/ICE
## 155             NON-TSTM WIND
## 156             NON TSTM WIND
## 157                BRUSH FIRE
## 158              BLOWING DUST
## 159              VOLCANIC ASH
## 160        HIGH SURF ADVISORY
## 161            HAZARDOUS SURF
## 162                  WILDFIRE
## 163              COLD WEATHER
## 164               ICE ON ROAD
## 165                  DROWNING
## 166   EXTREME COLD/WIND CHILL
## 167          MARINE TSTM WIND
## 168         HURRICANE/TYPHOON
## 169                 DENSE FOG
## 170        WINTER WEATHER/MIX
## 171    ASTRONOMICAL HIGH TIDE
## 172      HEAVY SURF/HIGH SURF
## 173       TROPICAL DEPRESSION
## 174          LAKE-EFFECT SNOW
## 175          MARINE HIGH WIND
## 176         THUNDERSTORM WIND
## 177                   TSUNAMI
## 178          STORM SURGE/TIDE
## 179           COLD/WIND CHILL
## 180           LAKESHORE FLOOD
## 181  MARINE THUNDERSTORM WIND
## 182        MARINE STRONG WIND
## 183     ASTRONOMICAL LOW TIDE
## 184               DENSE SMOKE
## 185               MARINE HAIL
## 186              FREEZING FOG

Oof. There are 186 observations, when the NOAA says there should only be 48. This means there are a lot of duplicates in there.

Because I’m looking for the most harmful events - and, I suppose, to avoid large amounts of tedious work - I’m going to only clean observations that are in the 90th percentile as far as health and economic impacts go.

## Change NA's to 0's
stormData <- stormData %>% 
        mutate(CROPDMGS = replace_na(CROPDMGS, 0))
stormData <- stormData %>% 
        mutate(PROPERTYDMGS = replace_na(PROPERTYDMGS, 0))

## Create variables representing health and economic impacts
stormData <- stormData %>% 
        mutate(HEALTHIMPACT = INJURIES + FATALITIES, ECONIMPACT = PROPERTYDMGS + CROPDMGS)

## View observations in 90th percentile of health impacts
stormData %>% 
        group_by(EVTYPE) %>% 
        summarise(HEALTHIMPACT = sum(HEALTHIMPACT)) %>% 
        arrange(desc(HEALTHIMPACT)) %>% 
        filter(HEALTHIMPACT > quantile(HEALTHIMPACT, 0.9))

## # A tibble: 19 x 2
##    EVTYPE            HEALTHIMPACT
##    <chr>                    <dbl>
##  1 TORNADO                  22178
##  2 EXCESSIVE HEAT            8188
##  3 FLOOD                     7172
##  4 LIGHTNING                 4792
##  5 TSTM WIND                 3870
##  6 FLASH FLOOD               2561
##  7 THUNDERSTORM WIND         1530
##  8 WINTER STORM              1483
##  9 HEAT                      1459
## 10 HURRICANE/TYPHOON         1339
## 11 HIGH WIND                 1318
## 12 WILDFIRE                   986
## 13 HEAVY SNOW                 805
## 14 FOG                        772
## 15 HAIL                       720
## 16 WILD/FOREST FIRE           557
## 17 RIP CURRENT                549
## 18 RIP CURRENTS               496
## 19 BLIZZARD                   455

## View observations in 90th percentile of economic impacts
stormData %>% 
        group_by(EVTYPE) %>% 
        summarise(ECONIMPACT = sum(ECONIMPACT)) %>% 
        arrange(desc(ECONIMPACT)) %>% 
        filter(ECONIMPACT > quantile(ECONIMPACT, 0.9))

## # A tibble: 19 x 2
##    EVTYPE             ECONIMPACT
##    <chr>                   <dbl>
##  1 FLOOD             30919611943
##  2 TORNADO           22100370719
##  3 HAIL              17071172869
##  4 FLASH FLOOD       16557105603
##  5 DROUGHT           14413667000
##  6 HURRICANE         11554229010
##  7 HURRICANE/TYPHOON 11003712799
##  8 HIGH WIND          5881421657
##  9 WILDFIRE           5054139799
## 10 TSTM WIND          5031971788
## 11 THUNDERSTORM WIND  3780985440
## 12 ICE STORM          3657908808
## 13 TROPICAL STORM     3170186549
## 14 WILD/FOREST FIRE   3108564829
## 15 WINTER STORM       1544687250
## 16 EXTREME COLD       1328733399
## 17 HEAVY RAIN         1313034240
## 18 FROST/FREEZE       1104666000
## 19 LIGHTNING           749975520

## Clean 90th percentile observations
stormData <- stormData %>% 
        mutate_at("EVTYPE", str_replace, "TSTM WIND", "THUNDERSTORM WIND") %>% 
        mutate_at("EVTYPE", str_replace, "RIP CURRENTS", "RIP CURRENT") %>% 
        mutate_at("EVTYPE", str_replace, "FOG", "FREEZING FOG") %>% 
        mutate_at("EVTYPE", str_replace, "WILD/FOREST FIRE", "WILDFIRE") %>%
        mutate_at("EVTYPE", str_replace, "HURRICANE/TYPHOON", "HURRICANE (TYPHOON)") %>% 
        mutate_at("EVTYPE", str_replace, "EXTREME COLD/WIND CHILL", "EXTREME COLD") %>% 
        mutate_at("EVTYPE", str_replace, "HURRICANE$", "HURRICANE (TYPHOON)")

At this point, the data set is clean enough to conduct some analysis on.

Results

Let’s visualize which weather events had the biggest health impacts.

## Events with the largest total health costs
healthCosts <- stormData %>% 
        group_by(EVTYPE) %>% 
        summarise(HEALTHIMPACT = sum(HEALTHIMPACT)) %>% 
        arrange(desc(HEALTHIMPACT)) %>%
        top_n(10) %>% 
        ggplot(aes(x = reorder(EVTYPE, HEALTHIMPACT), y = HEALTHIMPACT, color = EVTYPE)) +
        geom_bar(stat="identity") + 
        theme_bw() +
        theme(legend.position="none") +
        ggtitle("Health Costs (Injuries + Fatalities) by Event Type") +
        labs(x = "Event Type", y = "Total Health Impacts") +
        coord_flip()
        
## Events with the largest number of injuries
injuries <- stormData %>% 
        group_by(EVTYPE) %>% 
        summarise(INJURIES = sum(INJURIES)) %>% 
        arrange(desc(INJURIES)) %>%
        top_n(10) %>% 
        ggplot(aes(x = reorder(EVTYPE, INJURIES), y = INJURIES, color = EVTYPE)) +
        geom_bar(stat="identity") + 
        theme_bw() +
        theme(legend.position="none") +
        ggtitle("Injuries by Event Type") +
        labs(x = "Event Type", y = "Total Injuries") +
        coord_flip()

## Events with the largest number of deaths
deaths <- stormData %>% 
        group_by(EVTYPE) %>% 
        summarise(FATALITIES = sum(FATALITIES)) %>% 
        arrange(desc(FATALITIES)) %>%
        top_n(10) %>% 
        ggplot(aes(x = reorder(EVTYPE, FATALITIES), y = FATALITIES, color = EVTYPE)) +
        geom_bar(stat="identity") + 
        theme_bw() +
        theme(legend.position="none") +
        ggtitle("Fatalities by Event Type") +
        labs(x = "Event Type", y = "Total Deaths") +
        coord_flip()

## Plot
gridExtra::grid.arrange(healthCosts, injuries, deaths, heights = unit(c(2.5,2.5,2.5), c("in", "in", "in")), nrow = 3)

We observe that tornadoes lead to the most injuries, while excessive heat causes the most fatalities.

Now, let’s visualize which events had the biggest economic consequences.

## Events with the largest total economic costs
econData <- stormData %>% 
        group_by(EVTYPE) %>% 
        summarise(ECONIMPACT = sum(ECONIMPACT)) %>% 
        arrange(desc(ECONIMPACT)) %>% 
        top_n(10) %>% 
        ggplot(aes(x = reorder(EVTYPE, ECONIMPACT), y = ECONIMPACT, color = EVTYPE)) +
        geom_bar(stat="identity") + 
        theme_bw() +
        theme(legend.position="none") +
        ggtitle("Economic Costs (Property + Crop Damage) by Event Type") +
        labs(x = "Event Type", y = "Total Economic Costs") + 
        coord_flip() +
        scale_y_continuous(labels=scales::dollar_format())

## Events with the largest total property damage
propertyDamage <- stormData %>% 
        group_by(EVTYPE) %>% 
        summarise(PROPERTYDMGS = sum(PROPERTYDMGS)) %>% 
        arrange(desc(PROPERTYDMGS)) %>% 
        top_n(10) %>% 
        ggplot(aes(x = reorder(EVTYPE, PROPERTYDMGS), y = PROPERTYDMGS, color = EVTYPE)) +
        geom_bar(stat="identity") + 
        theme_bw() +
        theme(legend.position="none") +
        ggtitle("Property Damage by Event Type") +
        labs(x = "Event Type", y = "Total Property Damage") + 
        coord_flip() +
        scale_y_continuous(labels=scales::dollar_format())

cropDamage <- stormData %>% 
        group_by(EVTYPE) %>% 
        summarise(CROPDMGS = sum(CROPDMGS)) %>% 
        arrange(desc(CROPDMGS)) %>% 
        top_n(10) %>% 
        ggplot(aes(x = reorder(EVTYPE, CROPDMGS), y = CROPDMGS, color = EVTYPE)) +
        geom_bar(stat="identity") + 
        theme_bw() +
        theme(legend.position="none") +
        ggtitle("Crop Damage by Event Type") +
        labs(x = "Event Type", y = "Total Crop Damage") + 
        coord_flip() +
        scale_y_continuous(labels=scales::dollar_format())

## Plot
gridExtra::grid.arrange(econData, propertyDamage, cropDamage, heights = unit(c(2.5,2.5,2.5), c("in", "in", "in")), nrow = 3)

We observe that floods cause the most property damage, while droughts cause the most crop damage. This makes intuitive sense - floods are somewhat common in homes, and crops often die in a drought.

Severe Weather Analysis

Kevin Roche

17/08/2021

Synopsis

Setup

Data Processing

Results