Synopsis

Severe weather events have a substantial impact on communities across the United States. Each year, severe weather causes billions of dollars in damages to property and crops, and hundreds of lives are lost, in addition to thousands of injuries. The purpose of this analysis is to identify the types of weather events that cause the greatest damage to the population and to the economy.

The data used in this analysis is from the US National Oceanic and Atmospheric Administration (NOAA) database. The data can be downloaded here.

For more information on this data set, refer to the National Weather Service Storm Data Documentation.

Data Processing

First let’s load the libraries we will need.

library(R.utils)
library(dplyr)
library(ggplot2)
library(gridExtra)
library(knitr)
library(lubridate)

Next, we download the data.

If the data are not already in the working directory, we will need to download the data and unzip the file.

if(!file.exists("stormdata.csv")) {
 fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
 download.file(fileUrl, "stormdata.csv.bz2")
 bunzip2("stormdata.csv.bz2") }

Now we can read the data into R with read_csv.

storms <- read.csv("stormdata.csv")

We will start by taking a look at the storms data frame and get a feel for what it contains. Let’s look at the dimensions.

dim(storms)
## [1] 902297     37

We see that storms contains 902297 observations of 37 variables.

Next, let’s look at the column names.

colnames(storms)
##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"

The data frame contains information on many variables that will not be relevant for this analysis. Since we will be focusing on the impact of different types of natural disasters on population health and on the economy, let’s select only the variables that are relevant to that. I will keep the ‘BGN_DATE’ variable as well so that we can determine if there are any years that should be excluded. Let’s also make the variable names a little more readable.

storm_data <- storms %>%
  select(date = BGN_DATE, event_type = EVTYPE, fatalities = FATALITIES, injuries = INJURIES, 
         property_damage = PROPDMG, property_damage_exponent = PROPDMGEXP, 
         crop_damage = CROPDMG, crop_damage_exponent = CROPDMGEXP)

Let’s take a look at the structure of our new data frame.

str(storm_data)
## 'data.frame':    902297 obs. of  8 variables:
##  $ date                    : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ event_type              : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ fatalities              : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ injuries                : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ property_damage         : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ property_damage_exponent: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ crop_damage             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ crop_damage_exponent    : Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...

There are 985 unique values in the event type column. That’s a lot! I bet some of these are duplicates and should be grouped together. We will come back to that later.

The date information is stored as a factor. Let’s parse it and extract the year.

storm_data$date <- mdy_hms(storm_data$date)

storm_data <- storm_data %>%
  mutate(year = year(date)) %>%
  select(-date)

Now that we have added a year column, we can look at the distribution of the number of records created over time. The assignment has stated that the data collected in the earlier years is less complete. Making a histogram can show us which years we want to include in our analysis.

storm_data %>%
  ggplot(aes(x = year, fill = (year >= 1995))) +
  geom_histogram(binwidth = 1) +
  scale_x_continuous(breaks = seq(1950, 2010, 5)) +
  theme(axis.text.x = element_text(angle = 90)) +
  theme(legend.position = "none") +
  labs(title = "Histogram of records per year")

Sure enough, there is a big increase around 1995 in the number of events recorded. Let’s get rid of the earlier data so it doesn’t skew our analysis.

storm_data <- storm_data %>%
  filter(year >= 1995)

Let’s call str again and see where we are now.

str(storm_data)
## 'data.frame':    681500 obs. of  8 variables:
##  $ event_type              : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 201 629 657 657 409 244 786 786 244 786 ...
##  $ fatalities              : num  0 0 0 0 2 0 0 0 0 0 ...
##  $ injuries                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ property_damage         : num  0 0 0 0 0.1 0 0 0 0 0 ...
##  $ property_damage_exponent: Factor w/ 19 levels "","-","?","+",..: 1 1 1 1 14 1 1 1 1 1 ...
##  $ crop_damage             : num  0 0 0 0 10 0 0 0 0 0 ...
##  $ crop_damage_exponent    : Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 9 1 1 1 1 1 ...
##  $ year                    : num  1995 1995 1995 1995 1995 ...

fatalities and injuries are numerics representing how many people perished or sustained an injury during each event. That part is pretty straightforward. No processing needed here.

The columns representing damages to property and crops will need some processing. According to the following excerpt from the documentation provided by NOAA, we would excpect the property_damage_exponent and crop_damage_exponent columns to contain “K”, “M”, and “B” as factor levels.

Estimates should be rounded to three significant digits, followed by an alphabetical character signifying the magnitude of the number, i.e., 1.55B for $1,550,000,000. Alphabetical characters used to signify magnitude include “K” for thousands, “M” for millions, and “B” for billions.

However, the call to str shows that these columns also contain numbers and characters such as “?” and “+”. We can use table to find out how many of the observations containing these characters actually have non-zero property damages . While we’re at it, let’s fix the inconsistencies in case using toupper.

storm_data$crop_damage_exponent <- toupper(storm_data$crop_damage_exponent)
storm_data$property_damage_exponent <- toupper(storm_data$property_damage_exponent)

table(storm_data$crop_damage_exponent, storm_data$crop_damage > 0)
##    
##      FALSE   TRUE
##     400088      0
##   ?      5      0
##   0      4      4
##   2      1      0
##   B      2      5
##   K 261693  17793
##   M     76   1829
table(storm_data$property_damage_exponent, storm_data$property_damage > 0)
##    
##      FALSE   TRUE
##     294449     67
##   -      0      1
##   ?      5      0
##   +      0      4
##   0      4    183
##   1     25      0
##   2     12      1
##   3      3      1
##   4      0      3
##   5      9     17
##   6      0      3
##   7      3      2
##   8      1      0
##   B      0     37
##   H      0      6
##   K 188071 190635
##   M     11   7947

Ok. It looks like the vast majority of records with non-zero damages are marked with “K”, “M”, or “B”. (There are also 6 records with an “H” in the property damages table - we will assume that stands for “hundreds” and not exclude it.) We will consider the damage value for any records that do not adhere to this system to be 0.

Let’s calculate new columns property_damage_dollars and crop_damage_dollars that multiply the values in the numeric damages columns by the appropriate exponent.

storm_data <- storm_data %>%
  mutate(property_multiplier = case_when(
    property_damage_exponent == "H" ~ 10^2,
    property_damage_exponent == "K" ~ 10^3,
    property_damage_exponent == "M" ~ 10^6,
    property_damage_exponent == "B" ~ 10^9,
    !property_damage_exponent %in% c("H", "K", "M", "B") ~ 0 ))
    
storm_data <- storm_data %>%
  mutate(crop_multiplier = case_when(
    crop_damage_exponent == "K" ~ 10^3,
    crop_damage_exponent == "M" ~ 10^6,
    crop_damage_exponent == "B" ~ 10^9,
    !crop_damage_exponent %in% c("K", "M", "B") ~ 0 ))
   
storm_data$property_damage_dollars <- rep(0, nrow(storm_data))
storm_data$crop_damage_dollars <- rep(0, nrow(storm_data))
storm_data <- storm_data %>%
  mutate(property_damage_dollars = property_damage * property_multiplier,
         crop_damage_dollars = crop_damage * crop_multiplier)

Now that we have converted the damage values to dollar amounts, let’s go back and look at how the different event types are recorded.

head(unique(storm_data$event_type), 50)
##  [1] FREEZING RAIN                  SNOW                          
##  [3] SNOW/ICE                       HURRICANE OPAL/HIGH WINDS     
##  [5] HAIL                           THUNDERSTORM WINDS            
##  [7] RECORD COLD                    HURRICANE ERIN                
##  [9] HURRICANE OPAL                 DENSE FOG                     
## [11] RIP CURRENT                    TORNADO                       
## [13] THUNDERSTORM WINS              LIGHTNING                     
## [15] FLASH FLOOD                    FLASH FLOODING                
## [17] HIGH WINDS                     TORNADO F0                    
## [19] THUNDERSTORM WINDS LIGHTNING   FUNNEL CLOUD                  
## [21] THUNDERSTORM WINDS/HAIL        THUNDERSTORM WIND             
## [23] HEAT                           WIND                          
## [25] HEAVY RAINS                    LIGHTNING AND HEAVY RAIN      
## [27] HEAVY RAIN                     THUNDERSTORM WINDS HAIL       
## [29] FLOOD                          COLD                          
## [31] WALL CLOUD/FUNNEL CLOUD        THUNDERSTORM                  
## [33] WATERSPOUT                     EXTREME COLD                  
## [35] BLIZZARD WEATHER               BLIZZARD                      
## [37] WIND CHILL                     BREAKUP FLOODING              
## [39] RECORD HIGH TEMPERATURE        RECORD HIGH                   
## [41] HIGH WIND                      COASTAL FLOOD                 
## [43] RIVER FLOOD                    HIGH WINDS HEAVY RAINS        
## [45] HIGH WIND/LOW WIND CHILL       HEAVY SNOW                    
## [47] HEAVY SNOW/HIGH                RECORD LOW                    
## [49] HIGH WINDS AND WIND CHILL      HEAVY SNOW/HIGH WINDS/FREEZING
## 985 Levels:    HIGH SURF ADVISORY  COASTAL FLOOD ... WND

It looks like there is a lot of inconsistency in how the event type is recorded in the data. Just from this brief glance at the different event types, we can see that some entries contain misspellings or abbreviations, some have extra details, and some have characters such as “/”. We will need to do some more work to group the different types into classes before moving forward with the analysis.

Because we are only concerned with event types that cause the most damage, we don’t need to go through each event type individually and assign it to a class. We can limit ourselves to the event types that matter most. To figure out what those are, we can make a table showing the event types for events with > 50 fatalities. I will also convert all event types to lower case to facilitate this process.

storm_data$event_type <- tolower(storm_data$event_type)

event_types <- storm_data %>%
  group_by(event_type) %>%
  summarize(n = n(), deaths = sum(fatalities)) %>%
  arrange(desc(deaths)) %>%
  filter(deaths > 50) %>%
  ungroup()

kable(as.data.frame(event_types))
event_type n deaths
excessive heat 1675 1903
tornado 24335 1545
flash flood 52673 934
heat 755 924
lightning 14280 729
flood 24642 423
rip current 459 360
high wind 19958 241
tstm wind 128925 241
avalanche 380 223
rip currents 304 204
winter storm 11372 195
heat wave 69 161
thunderstorm wind 81746 131
extreme cold 634 128
extreme cold/wind chill 1002 125
heavy snow 14710 115
strong wind 3565 103
high surf 730 102
cold/wind chill 539 95
heavy rain 11640 95
extreme heat 17 91
ice storm 1928 84
wildfire 2757 75
blizzard 2656 71
hurricane/typhoon 88 64
fog 535 61
hurricane 170 61
tropical storm 684 57

“Excessive heat”, “heat wave”, “heat”, and “extreme heat” appear on this list separately. We can create a class “extreme heat” that groups them together. Let’s look at all the event types that pertain to heat.

storm_data$event_type <- tolower(storm_data$event_type)
unique(grep("heat", storm_data$event_type, value = TRUE))
## [1] "heat"                   "extreme heat"          
## [3] "excessive heat"         "record heat"           
## [5] "heat wave"              "drought/excessive heat"
## [7] "heat wave drought"      "heatburst"             
## [9] "excessive heat/drought"

Three of these contain “drought”. “Drought” is an important classification when it comes to crop damage, so we don’t want to lose that information by assigning them to “heat”. These will be reclassified later. For now, assign anything that has “heat” in the event type name to the “extreme heat” class.

storm_data$event_class <- storm_data$event_type
storm_data$event_class[grepl("heat", storm_data$event_type)] <- "extreme heat"

We can do the same thing for cold that we did for heat.

unique(grep("cold", storm_data$event_type, value = TRUE))
##  [1] "record cold"                  "cold"                        
##  [3] "extreme cold"                 "unseasonably cold"           
##  [5] "extreme/record cold"          "cold wave"                   
##  [7] "cold and wet conditions"      "cold air funnels"            
##  [9] "cold air funnel"              "prolong cold"                
## [11] "record cold/frost"            "record snow/cold"            
## [13] "cold weather"                 "prolong cold/snow"           
## [15] "unseasonable cold"            "excessive cold"              
## [17] "extended cold"                "cold temperature"            
## [19] "cold and snow"                "cold and frost"              
## [21] "cold temperatures"            "cold wind chill temperatures"
## [23] "record  cold"                 "unusually cold"              
## [25] "extreme cold/wind chill"      "cold/wind chill"

All of these event types can be grouped into an “extreme cold” category.

storm_data$event_class[grepl("cold", storm_data$event_type)] <- "extreme cold"

The “tornado” event type has the second most fatalities according to our table. “Tornado” is coded in several different ways, as was heat, so let’s create an event class for “tornado”.

storm_data$event_class[grepl("torn", storm_data$event_type)] <- "tornado"

I will do the same for a few of the other event types. I’ll spare you the gory details.

storm_data$event_class[grepl("flood|fld", storm_data$event_type)] <- "flood"
storm_data$event_class[grepl("fire", storm_data$event_type)] <- "fire"
storm_data$event_class[grepl("aval", storm_data$event_type)] <- "avalanche"
storm_data$event_class[grepl("rip", storm_data$event_type)] <- "rip current"
storm_data$event_class[grepl("thunder|tstm", storm_data$event_type)] <- "thunderstorm winds"
storm_data$event_class[grepl("hurricane", storm_data$event_type)] <- "hurricane"
storm_data$event_class[grepl("tropical", storm_data$event_type)] <- "tropical storm"
storm_data$event_class[grepl("wint|blizz", storm_data$event_type)] <- "winter storm"
storm_data$event_class[grepl("strong wind|high wind|gusty wind", storm_data$event_type)] <- "strong wind"
storm_data$event_class[grepl("ice", storm_data$event_type)] <- "ice"
storm_data$event_class[grepl("surge", storm_data$event_type)] <- "storm surge"

Let’s check our progress by making a table of the event classes including how many records are in each, and how many fatalities are attributed to each.

event_classes <- storm_data %>%
  group_by(event_class) %>%
  summarize(n = n(), deaths = sum(fatalities)) %>%
  arrange(desc(deaths)) %>%
  filter(deaths > 50) %>%
  ungroup()

kable(as.data.frame(event_classes))
event_class n deaths
extreme heat 2602 3087
tornado 24373 1545
flood 82613 1414
lightning 14280 729
rip current 766 569
thunderstorm winds 234076 440
extreme cold 2373 403
strong wind 24797 386
winter storm 22237 328
avalanche 380 223
hurricane 281 133
heavy snow 14710 115
high surf 730 102
heavy rain 11640 95
ice 2033 93
fire 4216 87
fog 535 61
tropical storm 749 57

We could keep classifying, but what we’ve done so far is enough to capture the top 10 event types in terms of impact on human population.

Since the types of events that affect crops are different than those that affect population and property, let’s investigate whether we need to do any more classification to capture crop damage.

crop_by_type <- storm_data %>%
  group_by(event_type) %>%
  summarize(n = n(), total_crop_damage= sum(crop_damage_dollars)) %>%
  arrange(desc(total_crop_damage)) %>%
  filter(total_crop_damage > 10000000) %>%
  ungroup()

kable(as.data.frame(crop_by_type))
event_type n total_crop_damage
drought 2471 13922066000
flood 24642 5422810400
hurricane 170 2741410000
hail 215932 2614127050
hurricane/typhoon 88 2607872800
flash flood 52673 1343915000
extreme cold 634 1312473000
frost/freeze 1343 1094186000
heavy rain 11640 728399800
tropical storm 684 677836000
high wind 19958 633561300
tstm wind 128925 553947350
excessive heat 1675 492402000
thunderstorm wind 81746 414354000
freeze 74 406725000
heat 755 401411500
tornado 24335 296595610
damaging freeze 8 296230000
wildfire 2757 295472800
excessive wetness 1 142000000
hurricane erin 7 136010000
heavy snow 14710 129153100
flood/rain/winds 6 112800000
wild/forest fire 1446 106782830
flood/flash flood 258 94979000
cold and wet conditions 1 66000000
strong wind 3565 64953500
tstm wind/hail 1028 64696250
thunderstorm winds 10041 60172900
early frost 1 42000000
high winds 776 34170000
severe thunderstorm winds 5 29000000
agricultural freeze 6 28820000
river flooding 18 28020000
unseasonably cold 23 25042500
small hail 52 20793000
landslide 599 20017000
hurricane opal 9 19000000
river flood 148 18909000
extreme windchill 204 17000000
severe thunderstorms 20 17000000
tropical storm jerry 3 16000000
ice storm 1928 15660000
winter weather 6992 15000000
hard freeze 7 13100000
winter storm 11372 11944000
lightning 14280 10766040
flash flooding 295 10075000

The processing we have already done will help clean this up, but there are a few event types that we didn’t consider in our analysis of the fatality events. Let’s add classes for “drought”, “frost/freeze”, “hail”, and “heavy rain”.

storm_data$event_class[grepl("drought", storm_data$event_type)] <- "drought"
storm_data$event_class[grepl("freeze|frost", storm_data$event_type)] <- "frost/freeze"
storm_data$event_class[grepl("hail", storm_data$event_type)] <- "hail"
storm_data$event_class[grepl("heavy rain|excessive wetness", storm_data$event_type)] <- "heavy rain"

Let’s see how the classification cleans up this table.

crop_by_type <- storm_data %>%
  group_by(event_class) %>%
  summarize(n = n(), total_crop_damage= sum(crop_damage_dollars)) %>%
  arrange(desc(total_crop_damage)) %>%
  filter(total_crop_damage > 5000000) %>%
  ungroup()

kable(head(as.data.frame(crop_by_type), 10))
event_class n total_crop_damage
drought 2493 13922121780
flood 82604 7040614500
hurricane 281 5504792800
hail 217619 2699785300
frost/freeze 1502 1886061000
extreme cold 2364 1409265500
thunderstorm winds 232977 1075788250
extreme heat 2587 899313500
heavy rain 11681 871909800
strong wind 24795 747914800
cat('\n')

We could keep going but this is enough classification to proceed with the analysis and get some results.

Results

population_impact <- storm_data %>%
  group_by(event_class)%>%
  summarize(total_deaths = sum(fatalities),
            total_injuries = sum(injuries)) 

injuries <- population_impact %>%
  arrange(total_injuries) %>%
  mutate(event_class = factor(event_class, event_class)) %>%
  arrange(desc(total_injuries))

fatalities <- population_impact %>%
  arrange(total_deaths) %>%
  mutate(event_class = factor(event_class, event_class)) %>%
  arrange(desc(total_deaths))

p1 <- fatalities[1:10, ] %>%
  ggplot(aes(x = event_class, y = total_deaths)) +
  geom_col(fill = 'salmon') +
  coord_flip() +
  labs(x = "", y = "Total Fatalities")

p2 <- injuries[1:10, ] %>%
  ggplot(aes(x = event_class, y = total_injuries)) +
  geom_col(fill = 'salmon') +
  coord_flip() +
    labs(x = "", y = "Total Injuries")

grid.arrange(p1, p2, nrow = 1, top = "Weather related fatalities and injuries in the US, 1995-2011")

Extreme heat, tornados, and flooding have caused the most population damage of all severe weather types from 1995 - 2011 in the US. Extreme heat has led to more fatalities than tornadoes, although tornadoes have caused more injuries.

 damages <- storm_data %>%
   group_by(event_class) %>%
   summarize(total_property_damage = sum(property_damage_dollars),
             total_crop_damage = sum(crop_damage_dollars)) %>%
   ungroup()
 
 property <- damages %>%
  arrange(total_property_damage) %>%
  mutate(event_class = factor(event_class, event_class)) %>%
  arrange(desc(total_property_damage))
 
  crops<- damages %>%
  arrange(total_crop_damage) %>%
  mutate(event_class = factor(event_class, event_class)) %>%
  arrange(desc(total_crop_damage))

p3 <- property[1:10, ] %>%
  ggplot(aes(x = event_class, y = total_property_damage/10^9)) +
  geom_col(fill = 'darkseagreen') +
  coord_flip() +
  labs(x = "", y = "Total property damage\n (in billions of US dollars)")

p4 <- crops[1:10, ] %>%
  ggplot(aes(x = event_class, y = total_crop_damage/10^9)) +
  geom_col(fill = 'darkseagreen') +
  coord_flip() +
  labs(x = "", y = "Total crop damage\n (in billions of US dollars)")

grid.arrange(p3, p4, nrow = 1, top = "Weather related damages to property and crops in the US, 1995-2011")

Not surprisingly, flooding, hurricanes, and storm surges have caused more property damage than any other types of severe weather. Tornadoes, hail, and strong winds also have a significant economic impact when it comes to property.

Drought has caused the highest damages to crops. Flooding, hurricanes, hail, and freezing weather are also leading contributors to crop damages.