Impact of Severe Weather Events on Population Health and Economy

Synopsis

Severe weather events have a substantial impact on communities across the United States. Each year, severe weather causes billions of dollars in damages to property and crops, and hundreds of lives are lost, in addition to thousands of injuries. The purpose of this analysis is to identify the types of weather events that cause the greatest damage to the population and to the economy.

The data used in this analysis is from the US National Oceanic and Atmospheric Administration (NOAA) database. The data can be downloaded here.

For more information on this data set, refer to the National Weather Service Storm Data Documentation.

Data Processing

First let’s load the libraries we will need.

library(R.utils)
library(dplyr)
library(ggplot2)
library(gridExtra)
library(knitr)
library(lubridate)

Next, we download the data.

If the data are not already in the working directory, we will need to download the data and unzip the file.

if(!file.exists("stormdata.csv")) {
 fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
 download.file(fileUrl, "stormdata.csv.bz2")
 bunzip2("stormdata.csv.bz2") }

Now we can read the data into R with read_csv.

storms <- read.csv("stormdata.csv")

We will start by taking a look at the storms data frame and get a feel for what it contains. Let’s look at the dimensions.

dim(storms)

## [1] 902297     37

We see that storms contains 902297 observations of 37 variables.

Next, let’s look at the column names.

colnames(storms)

##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"

The data frame contains information on many variables that will not be relevant for this analysis. Since we will be focusing on the impact of different types of natural disasters on population health and on the economy, let’s select only the variables that are relevant to that. I will keep the ‘BGN_DATE’ variable as well so that we can determine if there are any years that should be excluded. Let’s also make the variable names a little more readable.

storm_data <- storms %>%
  select(date = BGN_DATE, event_type = EVTYPE, fatalities = FATALITIES, injuries = INJURIES, 
         property_damage = PROPDMG, property_damage_exponent = PROPDMGEXP, 
         crop_damage = CROPDMG, crop_damage_exponent = CROPDMGEXP)

Let’s take a look at the structure of our new data frame.

str(storm_data)

## 'data.frame':    902297 obs. of  8 variables:
##  $ date                    : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ event_type              : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ fatalities              : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ injuries                : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ property_damage         : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ property_damage_exponent: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ crop_damage             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ crop_damage_exponent    : Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...

There are 985 unique values in the event type column. That’s a lot! I bet some of these are duplicates and should be grouped together. We will come back to that later.

The date information is stored as a factor. Let’s parse it and extract the year.

storm_data$date <- mdy_hms(storm_data$date)

storm_data <- storm_data %>%
  mutate(year = year(date)) %>%
  select(-date)

Now that we have added a year column, we can look at the distribution of the number of records created over time. The assignment has stated that the data collected in the earlier years is less complete. Making a histogram can show us which years we want to include in our analysis.

storm_data %>%
  ggplot(aes(x = year, fill = (year >= 1995))) +
  geom_histogram(binwidth = 1) +
  scale_x_continuous(breaks = seq(1950, 2010, 5)) +
  theme(axis.text.x = element_text(angle = 90)) +
  theme(legend.position = "none") +
  labs(title = "Histogram of records per year")

Sure enough, there is a big increase around 1995 in the number of events recorded. Let’s get rid of the earlier data so it doesn’t skew our analysis.

storm_data <- storm_data %>%
  filter(year >= 1995)

Let’s call str again and see where we are now.

str(storm_data)

## 'data.frame':    681500 obs. of  8 variables:
##  $ event_type              : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 201 629 657 657 409 244 786 786 244 786 ...
##  $ fatalities              : num  0 0 0 0 2 0 0 0 0 0 ...
##  $ injuries                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ property_damage         : num  0 0 0 0 0.1 0 0 0 0 0 ...
##  $ property_damage_exponent: Factor w/ 19 levels "","-","?","+",..: 1 1 1 1 14 1 1 1 1 1 ...
##  $ crop_damage             : num  0 0 0 0 10 0 0 0 0 0 ...
##  $ crop_damage_exponent    : Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 9 1 1 1 1 1 ...
##  $ year                    : num  1995 1995 1995 1995 1995 ...

fatalities and injuries are numerics representing how many people perished or sustained an injury during each event. That part is pretty straightforward. No processing needed here.

The columns representing damages to property and crops will need some processing. According to the following excerpt from the documentation provided by NOAA, we would excpect the property_damage_exponent and crop_damage_exponent columns to contain “K”, “M”, and “B” as factor levels.

Estimates should be rounded to three significant digits, followed by an alphabetical character signifying the magnitude of the number, i.e., 1.55B for $1,550,000,000. Alphabetical characters used to signify magnitude include “K” for thousands, “M” for millions, and “B” for billions.

However, the call to str shows that these columns also contain numbers and characters such as “?” and “+”. We can use table to find out how many of the observations containing these characters actually have non-zero property damages . While we’re at it, let’s fix the inconsistencies in case using toupper.

storm_data$crop_damage_exponent <- toupper(storm_data$crop_damage_exponent)
storm_data$property_damage_exponent <- toupper(storm_data$property_damage_exponent)

table(storm_data$crop_damage_exponent, storm_data$crop_damage > 0)

##    
##      FALSE   TRUE
##     400088      0
##   ?      5      0
##   0      4      4
##   2      1      0
##   B      2      5
##   K 261693  17793
##   M     76   1829

table(storm_data$property_damage_exponent, storm_data$property_damage > 0)

##    
##      FALSE   TRUE
##     294449     67
##   -      0      1
##   ?      5      0
##   +      0      4
##   0      4    183
##   1     25      0
##   2     12      1
##   3      3      1
##   4      0      3
##   5      9     17
##   6      0      3
##   7      3      2
##   8      1      0
##   B      0     37
##   H      0      6
##   K 188071 190635
##   M     11   7947

Ok. It looks like the vast majority of records with non-zero damages are marked with “K”, “M”, or “B”. (There are also 6 records with an “H” in the property damages table - we will assume that stands for “hundreds” and not exclude it.) We will consider the damage value for any records that do not adhere to this system to be 0.

Let’s calculate new columns property_damage_dollars and crop_damage_dollars that multiply the values in the numeric damages columns by the appropriate exponent.

storm_data <- storm_data %>%
  mutate(property_multiplier = case_when(
    property_damage_exponent == "H" ~ 10^2,
    property_damage_exponent == "K" ~ 10^3,
    property_damage_exponent == "M" ~ 10^6,
    property_damage_exponent == "B" ~ 10^9,
    !property_damage_exponent %in% c("H", "K", "M", "B") ~ 0 ))
    
storm_data <- storm_data %>%
  mutate(crop_multiplier = case_when(
    crop_damage_exponent == "K" ~ 10^3,
    crop_damage_exponent == "M" ~ 10^6,
    crop_damage_exponent == "B" ~ 10^9,
    !crop_damage_exponent %in% c("K", "M", "B") ~ 0 ))
   
storm_data$property_damage_dollars <- rep(0, nrow(storm_data))
storm_data$crop_damage_dollars <- rep(0, nrow(storm_data))
storm_data <- storm_data %>%
  mutate(property_damage_dollars = property_damage * property_multiplier,
         crop_damage_dollars = crop_damage * crop_multiplier)

Now that we have converted the damage values to dollar amounts, let’s go back and look at how the different event types are recorded.

head(unique(storm_data$event_type), 50)

##  [1] FREEZING RAIN                  SNOW                          
##  [3] SNOW/ICE                       HURRICANE OPAL/HIGH WINDS     
##  [5] HAIL                           THUNDERSTORM WINDS            
##  [7] RECORD COLD                    HURRICANE ERIN                
##  [9] HURRICANE OPAL                 DENSE FOG                     
## [11] RIP CURRENT                    TORNADO                       
## [13] THUNDERSTORM WINS              LIGHTNING                     
## [15] FLASH FLOOD                    FLASH FLOODING                
## [17] HIGH WINDS                     TORNADO F0                    
## [19] THUNDERSTORM WINDS LIGHTNING   FUNNEL CLOUD                  
## [21] THUNDERSTORM WINDS/HAIL        THUNDERSTORM WIND             
## [23] HEAT                           WIND                          
## [25] HEAVY RAINS                    LIGHTNING AND HEAVY RAIN      
## [27] HEAVY RAIN                     THUNDERSTORM WINDS HAIL       
## [29] FLOOD                          COLD                          
## [31] WALL CLOUD/FUNNEL CLOUD        THUNDERSTORM                  
## [33] WATERSPOUT                     EXTREME COLD                  
## [35] BLIZZARD WEATHER               BLIZZARD                      
## [37] WIND CHILL                     BREAKUP FLOODING              
## [39] RECORD HIGH TEMPERATURE        RECORD HIGH                   
## [41] HIGH WIND                      COASTAL FLOOD                 
## [43] RIVER FLOOD                    HIGH WINDS HEAVY RAINS        
## [45] HIGH WIND/LOW WIND CHILL       HEAVY SNOW                    
## [47] HEAVY SNOW/HIGH                RECORD LOW                    
## [49] HIGH WINDS AND WIND CHILL      HEAVY SNOW/HIGH WINDS/FREEZING
## 985 Levels:    HIGH SURF ADVISORY  COASTAL FLOOD ... WND

It looks like there is a lot of inconsistency in how the event type is recorded in the data. Just from this brief glance at the different event types, we can see that some entries contain misspellings or abbreviations, some have extra details, and some have characters such as “/”. We will need to do some more work to group the different types into classes before moving forward with the analysis.

Because we are only concerned with event types that cause the most damage, we don’t need to go through each event type individually and assign it to a class. We can limit ourselves to the event types that matter most. To figure out what those are, we can make a table showing the event types for events with > 50 fatalities. I will also convert all event types to lower case to facilitate this process.

storm_data$event_type <- tolower(storm_data$event_type)

event_types <- storm_data %>%
  group_by(event_type) %>%
  summarize(n = n(), deaths = sum(fatalities)) %>%
  arrange(desc(deaths)) %>%
  filter(deaths > 50) %>%
  ungroup()

kable(as.data.frame(event_types))

event_type	n	deaths
excessive heat	1675	1903
tornado	24335	1545
flash flood	52673	934
heat	755	924
lightning	14280	729
flood	24642	423
rip current	459	360
high wind	19958	241
tstm wind	128925	241
avalanche	380	223
rip currents	304	204
winter storm	11372	195
heat wave	69	161
thunderstorm wind	81746	131
extreme cold	634	128
extreme cold/wind chill	1002	125
heavy snow	14710	115
strong wind	3565	103
high surf	730	102
cold/wind chill	539	95
heavy rain	11640	95
extreme heat	17	91
ice storm	1928	84
wildfire	2757	75
blizzard	2656	71
hurricane/typhoon	88	64
fog	535	61
hurricane	170	61
tropical storm	684	57

“Excessive heat”, “heat wave”, “heat”, and “extreme heat” appear on this list separately. We can create a class “extreme heat” that groups them together. Let’s look at all the event types that pertain to heat.

storm_data$event_type <- tolower(storm_data$event_type)
unique(grep("heat", storm_data$event_type, value = TRUE))

## [1] "heat"                   "extreme heat"          
## [3] "excessive heat"         "record heat"           
## [5] "heat wave"              "drought/excessive heat"
## [7] "heat wave drought"      "heatburst"             
## [9] "excessive heat/drought"

Three of these contain “drought”. “Drought” is an important classification when it comes to crop damage, so we don’t want to lose that information by assigning them to “heat”. These will be reclassified later. For now, assign anything that has “heat” in the event type name to the “extreme heat” class.

storm_data$event_class <- storm_data$event_type
storm_data$event_class[grepl("heat", storm_data$event_type)] <- "extreme heat"

We can do the same thing for cold that we did for heat.

unique(grep("cold", storm_data$event_type, value = TRUE))

##  [1] "record cold"                  "cold"                        
##  [3] "extreme cold"                 "unseasonably cold"           
##  [5] "extreme/record cold"          "cold wave"                   
##  [7] "cold and wet conditions"      "cold air funnels"            
##  [9] "cold air funnel"              "prolong cold"                
## [11] "record cold/frost"            "record snow/cold"            
## [13] "cold weather"                 "prolong cold/snow"           
## [15] "unseasonable cold"            "excessive cold"              
## [17] "extended cold"                "cold temperature"            
## [19] "cold and snow"                "cold and frost"              
## [21] "cold temperatures"            "cold wind chill temperatures"
## [23] "record  cold"                 "unusually cold"              
## [25] "extreme cold/wind chill"      "cold/wind chill"

All of these event types can be grouped into an “extreme cold” category.

storm_data$event_class[grepl("cold", storm_data$event_type)] <- "extreme cold"

The “tornado” event type has the second most fatalities according to our table. “Tornado” is coded in several different ways, as was heat, so let’s create an event class for “tornado”.

storm_data$event_class[grepl("torn", storm_data$event_type)] <- "tornado"

I will do the same for a few of the other event types. I’ll spare you the gory details.

storm_data$event_class[grepl("flood|fld", storm_data$event_type)] <- "flood"
storm_data$event_class[grepl("fire", storm_data$event_type)] <- "fire"
storm_data$event_class[grepl("aval", storm_data$event_type)] <- "avalanche"
storm_data$event_class[grepl("rip", storm_data$event_type)] <- "rip current"
storm_data$event_class[grepl("thunder|tstm", storm_data$event_type)] <- "thunderstorm winds"
storm_data$event_class[grepl("hurricane", storm_data$event_type)] <- "hurricane"
storm_data$event_class[grepl("tropical", storm_data$event_type)] <- "tropical storm"
storm_data$event_class[grepl("wint|blizz", storm_data$event_type)] <- "winter storm"
storm_data$event_class[grepl("strong wind|high wind|gusty wind", storm_data$event_type)] <- "strong wind"
storm_data$event_class[grepl("ice", storm_data$event_type)] <- "ice"
storm_data$event_class[grepl("surge", storm_data$event_type)] <- "storm surge"

Let’s check our progress by making a table of the event classes including how many records are in each, and how many fatalities are attributed to each.

event_classes <- storm_data %>%
  group_by(event_class) %>%
  summarize(n = n(), deaths = sum(fatalities)) %>%
  arrange(desc(deaths)) %>%
  filter(deaths > 50) %>%
  ungroup()

kable(as.data.frame(event_classes))

event_class	n	deaths
extreme heat	2602	3087
tornado	24373	1545
flood	82613	1414
lightning	14280	729
rip current	766	569
thunderstorm winds	234076	440
extreme cold	2373	403
strong wind	24797	386
winter storm	22237	328
avalanche	380	223
hurricane	281	133
heavy snow	14710	115
high surf	730	102
heavy rain	11640	95
ice	2033	93
fire	4216	87
fog	535	61
tropical storm	749	57

We could keep classifying, but what we’ve done so far is enough to capture the top 10 event types in terms of impact on human population.

Since the types of events that affect crops are different than those that affect population and property, let’s investigate whether we need to do any more classification to capture crop damage.

crop_by_type <- storm_data %>%
  group_by(event_type) %>%
  summarize(n = n(), total_crop_damage= sum(crop_damage_dollars)) %>%
  arrange(desc(total_crop_damage)) %>%
  filter(total_crop_damage > 10000000) %>%
  ungroup()

kable(as.data.frame(crop_by_type))

event_type	n	total_crop_damage
drought	2471	13922066000
flood	24642	5422810400
hurricane	170	2741410000
hail	215932	2614127050
hurricane/typhoon	88	2607872800
flash flood	52673	1343915000
extreme cold	634	1312473000
frost/freeze	1343	1094186000
heavy rain	11640	728399800
tropical storm	684	677836000
high wind	19958	633561300
tstm wind	128925	553947350
excessive heat	1675	492402000
thunderstorm wind	81746	414354000
freeze	74	406725000
heat	755	401411500
tornado	24335	296595610
damaging freeze	8	296230000
wildfire	2757	295472800
excessive wetness	1	142000000
hurricane erin	7	136010000
heavy snow	14710	129153100
flood/rain/winds	6	112800000
wild/forest fire	1446	106782830
flood/flash flood	258	94979000
cold and wet conditions	1	66000000
strong wind	3565	64953500
tstm wind/hail	1028	64696250
thunderstorm winds	10041	60172900
early frost	1	42000000
high winds	776	34170000
severe thunderstorm winds	5	29000000
agricultural freeze	6	28820000
river flooding	18	28020000
unseasonably cold	23	25042500
small hail	52	20793000
landslide	599	20017000
hurricane opal	9	19000000
river flood	148	18909000
extreme windchill	204	17000000
severe thunderstorms	20	17000000
tropical storm jerry	3	16000000
ice storm	1928	15660000
winter weather	6992	15000000
hard freeze	7	13100000
winter storm	11372	11944000
lightning	14280	10766040
flash flooding	295	10075000

The processing we have already done will help clean this up, but there are a few event types that we didn’t consider in our analysis of the fatality events. Let’s add classes for “drought”, “frost/freeze”, “hail”, and “heavy rain”.

storm_data$event_class[grepl("drought", storm_data$event_type)] <- "drought"
storm_data$event_class[grepl("freeze|frost", storm_data$event_type)] <- "frost/freeze"
storm_data$event_class[grepl("hail", storm_data$event_type)] <- "hail"
storm_data$event_class[grepl("heavy rain|excessive wetness", storm_data$event_type)] <- "heavy rain"

Let’s see how the classification cleans up this table.

crop_by_type <- storm_data %>%
  group_by(event_class) %>%
  summarize(n = n(), total_crop_damage= sum(crop_damage_dollars)) %>%
  arrange(desc(total_crop_damage)) %>%
  filter(total_crop_damage > 5000000) %>%
  ungroup()

kable(head(as.data.frame(crop_by_type), 10))

event_class	n	total_crop_damage
drought	2493	13922121780
flood	82604	7040614500
hurricane	281	5504792800
hail	217619	2699785300
frost/freeze	1502	1886061000
extreme cold	2364	1409265500
thunderstorm winds	232977	1075788250
extreme heat	2587	899313500
heavy rain	11681	871909800
strong wind	24795	747914800

cat('\n')

We could keep going but this is enough classification to proceed with the analysis and get some results.

Results

population_impact <- storm_data %>%
  group_by(event_class)%>%
  summarize(total_deaths = sum(fatalities),
            total_injuries = sum(injuries)) 

injuries <- population_impact %>%
  arrange(total_injuries) %>%
  mutate(event_class = factor(event_class, event_class)) %>%
  arrange(desc(total_injuries))

fatalities <- population_impact %>%
  arrange(total_deaths) %>%
  mutate(event_class = factor(event_class, event_class)) %>%
  arrange(desc(total_deaths))

p1 <- fatalities[1:10, ] %>%
  ggplot(aes(x = event_class, y = total_deaths)) +
  geom_col(fill = 'salmon') +
  coord_flip() +
  labs(x = "", y = "Total Fatalities")

p2 <- injuries[1:10, ] %>%
  ggplot(aes(x = event_class, y = total_injuries)) +
  geom_col(fill = 'salmon') +
  coord_flip() +
    labs(x = "", y = "Total Injuries")

grid.arrange(p1, p2, nrow = 1, top = "Weather related fatalities and injuries in the US, 1995-2011")

Extreme heat, tornados, and flooding have caused the most population damage of all severe weather types from 1995 - 2011 in the US. Extreme heat has led to more fatalities than tornadoes, although tornadoes have caused more injuries.

 damages <- storm_data %>%
   group_by(event_class) %>%
   summarize(total_property_damage = sum(property_damage_dollars),
             total_crop_damage = sum(crop_damage_dollars)) %>%
   ungroup()
 
 property <- damages %>%
  arrange(total_property_damage) %>%
  mutate(event_class = factor(event_class, event_class)) %>%
  arrange(desc(total_property_damage))
 
  crops<- damages %>%
  arrange(total_crop_damage) %>%
  mutate(event_class = factor(event_class, event_class)) %>%
  arrange(desc(total_crop_damage))

p3 <- property[1:10, ] %>%
  ggplot(aes(x = event_class, y = total_property_damage/10^9)) +
  geom_col(fill = 'darkseagreen') +
  coord_flip() +
  labs(x = "", y = "Total property damage\n (in billions of US dollars)")

p4 <- crops[1:10, ] %>%
  ggplot(aes(x = event_class, y = total_crop_damage/10^9)) +
  geom_col(fill = 'darkseagreen') +
  coord_flip() +
  labs(x = "", y = "Total crop damage\n (in billions of US dollars)")

grid.arrange(p3, p4, nrow = 1, top = "Weather related damages to property and crops in the US, 1995-2011")

Not surprisingly, flooding, hurricanes, and storm surges have caused more property damage than any other types of severe weather. Tornadoes, hail, and strong winds also have a significant economic impact when it comes to property.

Drought has caused the highest damages to crops. Flooding, hurricanes, hail, and freezing weather are also leading contributors to crop damages.