Severe weather events have a substantial impact on communities across the United States. Each year, severe weather causes billions of dollars in damages to property and crops, and hundreds of lives are lost, in addition to thousands of injuries. The purpose of this analysis is to identify the types of weather events that cause the greatest damage to the population and to the economy.
The data used in this analysis is from the US National Oceanic and Atmospheric Administration (NOAA) database. The data can be downloaded here.
For more information on this data set, refer to the National Weather Service Storm Data Documentation.
First let’s load the libraries we will need.
library(R.utils)
library(dplyr)
library(ggplot2)
library(gridExtra)
library(knitr)
library(lubridate)
Next, we download the data.
If the data are not already in the working directory, we will need to download the data and unzip the file.
if(!file.exists("stormdata.csv")) {
fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(fileUrl, "stormdata.csv.bz2")
bunzip2("stormdata.csv.bz2") }
Now we can read the data into R with read_csv.
storms <- read.csv("stormdata.csv")
We will start by taking a look at the storms data frame and get a feel for what it contains. Let’s look at the dimensions.
dim(storms)
## [1] 902297 37
We see that storms contains 902297 observations of 37 variables.
Next, let’s look at the column names.
colnames(storms)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
The data frame contains information on many variables that will not be relevant for this analysis. Since we will be focusing on the impact of different types of natural disasters on population health and on the economy, let’s select only the variables that are relevant to that. I will keep the ‘BGN_DATE’ variable as well so that we can determine if there are any years that should be excluded. Let’s also make the variable names a little more readable.
storm_data <- storms %>%
select(date = BGN_DATE, event_type = EVTYPE, fatalities = FATALITIES, injuries = INJURIES,
property_damage = PROPDMG, property_damage_exponent = PROPDMGEXP,
crop_damage = CROPDMG, crop_damage_exponent = CROPDMGEXP)
Let’s take a look at the structure of our new data frame.
str(storm_data)
## 'data.frame': 902297 obs. of 8 variables:
## $ date : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ event_type : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ fatalities : num 0 0 0 0 0 0 0 0 1 0 ...
## $ injuries : num 15 0 2 2 2 6 1 0 14 0 ...
## $ property_damage : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ property_damage_exponent: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ crop_damage : num 0 0 0 0 0 0 0 0 0 0 ...
## $ crop_damage_exponent : Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
There are 985 unique values in the event type column. That’s a lot! I bet some of these are duplicates and should be grouped together. We will come back to that later.
The date information is stored as a factor. Let’s parse it and extract the year.
storm_data$date <- mdy_hms(storm_data$date)
storm_data <- storm_data %>%
mutate(year = year(date)) %>%
select(-date)
Now that we have added a year column, we can look at the distribution of the number of records created over time. The assignment has stated that the data collected in the earlier years is less complete. Making a histogram can show us which years we want to include in our analysis.
storm_data %>%
ggplot(aes(x = year, fill = (year >= 1995))) +
geom_histogram(binwidth = 1) +
scale_x_continuous(breaks = seq(1950, 2010, 5)) +
theme(axis.text.x = element_text(angle = 90)) +
theme(legend.position = "none") +
labs(title = "Histogram of records per year")
Sure enough, there is a big increase around 1995 in the number of events recorded. Let’s get rid of the earlier data so it doesn’t skew our analysis.
storm_data <- storm_data %>%
filter(year >= 1995)
Let’s call str again and see where we are now.
str(storm_data)
## 'data.frame': 681500 obs. of 8 variables:
## $ event_type : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 201 629 657 657 409 244 786 786 244 786 ...
## $ fatalities : num 0 0 0 0 2 0 0 0 0 0 ...
## $ injuries : num 0 0 0 0 0 0 0 0 0 0 ...
## $ property_damage : num 0 0 0 0 0.1 0 0 0 0 0 ...
## $ property_damage_exponent: Factor w/ 19 levels "","-","?","+",..: 1 1 1 1 14 1 1 1 1 1 ...
## $ crop_damage : num 0 0 0 0 10 0 0 0 0 0 ...
## $ crop_damage_exponent : Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 9 1 1 1 1 1 ...
## $ year : num 1995 1995 1995 1995 1995 ...
fatalities and injuries are numerics representing how many people perished or sustained an injury during each event. That part is pretty straightforward. No processing needed here.
The columns representing damages to property and crops will need some processing. According to the following excerpt from the documentation provided by NOAA, we would excpect the property_damage_exponent and crop_damage_exponent columns to contain “K”, “M”, and “B” as factor levels.
Estimates should be rounded to three significant digits, followed by an alphabetical character signifying the magnitude of the number, i.e., 1.55B for $1,550,000,000. Alphabetical characters used to signify magnitude include “K” for thousands, “M” for millions, and “B” for billions.
However, the call to str shows that these columns also contain numbers and characters such as “?” and “+”. We can use table to find out how many of the observations containing these characters actually have non-zero property damages . While we’re at it, let’s fix the inconsistencies in case using toupper.
storm_data$crop_damage_exponent <- toupper(storm_data$crop_damage_exponent)
storm_data$property_damage_exponent <- toupper(storm_data$property_damage_exponent)
table(storm_data$crop_damage_exponent, storm_data$crop_damage > 0)
##
## FALSE TRUE
## 400088 0
## ? 5 0
## 0 4 4
## 2 1 0
## B 2 5
## K 261693 17793
## M 76 1829
table(storm_data$property_damage_exponent, storm_data$property_damage > 0)
##
## FALSE TRUE
## 294449 67
## - 0 1
## ? 5 0
## + 0 4
## 0 4 183
## 1 25 0
## 2 12 1
## 3 3 1
## 4 0 3
## 5 9 17
## 6 0 3
## 7 3 2
## 8 1 0
## B 0 37
## H 0 6
## K 188071 190635
## M 11 7947
Ok. It looks like the vast majority of records with non-zero damages are marked with “K”, “M”, or “B”. (There are also 6 records with an “H” in the property damages table - we will assume that stands for “hundreds” and not exclude it.) We will consider the damage value for any records that do not adhere to this system to be 0.
Let’s calculate new columns property_damage_dollars and crop_damage_dollars that multiply the values in the numeric damages columns by the appropriate exponent.
storm_data <- storm_data %>%
mutate(property_multiplier = case_when(
property_damage_exponent == "H" ~ 10^2,
property_damage_exponent == "K" ~ 10^3,
property_damage_exponent == "M" ~ 10^6,
property_damage_exponent == "B" ~ 10^9,
!property_damage_exponent %in% c("H", "K", "M", "B") ~ 0 ))
storm_data <- storm_data %>%
mutate(crop_multiplier = case_when(
crop_damage_exponent == "K" ~ 10^3,
crop_damage_exponent == "M" ~ 10^6,
crop_damage_exponent == "B" ~ 10^9,
!crop_damage_exponent %in% c("K", "M", "B") ~ 0 ))
storm_data$property_damage_dollars <- rep(0, nrow(storm_data))
storm_data$crop_damage_dollars <- rep(0, nrow(storm_data))
storm_data <- storm_data %>%
mutate(property_damage_dollars = property_damage * property_multiplier,
crop_damage_dollars = crop_damage * crop_multiplier)
Now that we have converted the damage values to dollar amounts, let’s go back and look at how the different event types are recorded.
head(unique(storm_data$event_type), 50)
## [1] FREEZING RAIN SNOW
## [3] SNOW/ICE HURRICANE OPAL/HIGH WINDS
## [5] HAIL THUNDERSTORM WINDS
## [7] RECORD COLD HURRICANE ERIN
## [9] HURRICANE OPAL DENSE FOG
## [11] RIP CURRENT TORNADO
## [13] THUNDERSTORM WINS LIGHTNING
## [15] FLASH FLOOD FLASH FLOODING
## [17] HIGH WINDS TORNADO F0
## [19] THUNDERSTORM WINDS LIGHTNING FUNNEL CLOUD
## [21] THUNDERSTORM WINDS/HAIL THUNDERSTORM WIND
## [23] HEAT WIND
## [25] HEAVY RAINS LIGHTNING AND HEAVY RAIN
## [27] HEAVY RAIN THUNDERSTORM WINDS HAIL
## [29] FLOOD COLD
## [31] WALL CLOUD/FUNNEL CLOUD THUNDERSTORM
## [33] WATERSPOUT EXTREME COLD
## [35] BLIZZARD WEATHER BLIZZARD
## [37] WIND CHILL BREAKUP FLOODING
## [39] RECORD HIGH TEMPERATURE RECORD HIGH
## [41] HIGH WIND COASTAL FLOOD
## [43] RIVER FLOOD HIGH WINDS HEAVY RAINS
## [45] HIGH WIND/LOW WIND CHILL HEAVY SNOW
## [47] HEAVY SNOW/HIGH RECORD LOW
## [49] HIGH WINDS AND WIND CHILL HEAVY SNOW/HIGH WINDS/FREEZING
## 985 Levels: HIGH SURF ADVISORY COASTAL FLOOD ... WND
It looks like there is a lot of inconsistency in how the event type is recorded in the data. Just from this brief glance at the different event types, we can see that some entries contain misspellings or abbreviations, some have extra details, and some have characters such as “/”. We will need to do some more work to group the different types into classes before moving forward with the analysis.
Because we are only concerned with event types that cause the most damage, we don’t need to go through each event type individually and assign it to a class. We can limit ourselves to the event types that matter most. To figure out what those are, we can make a table showing the event types for events with > 50 fatalities. I will also convert all event types to lower case to facilitate this process.
storm_data$event_type <- tolower(storm_data$event_type)
event_types <- storm_data %>%
group_by(event_type) %>%
summarize(n = n(), deaths = sum(fatalities)) %>%
arrange(desc(deaths)) %>%
filter(deaths > 50) %>%
ungroup()
kable(as.data.frame(event_types))
| event_type | n | deaths |
|---|---|---|
| excessive heat | 1675 | 1903 |
| tornado | 24335 | 1545 |
| flash flood | 52673 | 934 |
| heat | 755 | 924 |
| lightning | 14280 | 729 |
| flood | 24642 | 423 |
| rip current | 459 | 360 |
| high wind | 19958 | 241 |
| tstm wind | 128925 | 241 |
| avalanche | 380 | 223 |
| rip currents | 304 | 204 |
| winter storm | 11372 | 195 |
| heat wave | 69 | 161 |
| thunderstorm wind | 81746 | 131 |
| extreme cold | 634 | 128 |
| extreme cold/wind chill | 1002 | 125 |
| heavy snow | 14710 | 115 |
| strong wind | 3565 | 103 |
| high surf | 730 | 102 |
| cold/wind chill | 539 | 95 |
| heavy rain | 11640 | 95 |
| extreme heat | 17 | 91 |
| ice storm | 1928 | 84 |
| wildfire | 2757 | 75 |
| blizzard | 2656 | 71 |
| hurricane/typhoon | 88 | 64 |
| fog | 535 | 61 |
| hurricane | 170 | 61 |
| tropical storm | 684 | 57 |
“Excessive heat”, “heat wave”, “heat”, and “extreme heat” appear on this list separately. We can create a class “extreme heat” that groups them together. Let’s look at all the event types that pertain to heat.
storm_data$event_type <- tolower(storm_data$event_type)
unique(grep("heat", storm_data$event_type, value = TRUE))
## [1] "heat" "extreme heat"
## [3] "excessive heat" "record heat"
## [5] "heat wave" "drought/excessive heat"
## [7] "heat wave drought" "heatburst"
## [9] "excessive heat/drought"
Three of these contain “drought”. “Drought” is an important classification when it comes to crop damage, so we don’t want to lose that information by assigning them to “heat”. These will be reclassified later. For now, assign anything that has “heat” in the event type name to the “extreme heat” class.
storm_data$event_class <- storm_data$event_type
storm_data$event_class[grepl("heat", storm_data$event_type)] <- "extreme heat"
We can do the same thing for cold that we did for heat.
unique(grep("cold", storm_data$event_type, value = TRUE))
## [1] "record cold" "cold"
## [3] "extreme cold" "unseasonably cold"
## [5] "extreme/record cold" "cold wave"
## [7] "cold and wet conditions" "cold air funnels"
## [9] "cold air funnel" "prolong cold"
## [11] "record cold/frost" "record snow/cold"
## [13] "cold weather" "prolong cold/snow"
## [15] "unseasonable cold" "excessive cold"
## [17] "extended cold" "cold temperature"
## [19] "cold and snow" "cold and frost"
## [21] "cold temperatures" "cold wind chill temperatures"
## [23] "record cold" "unusually cold"
## [25] "extreme cold/wind chill" "cold/wind chill"
All of these event types can be grouped into an “extreme cold” category.
storm_data$event_class[grepl("cold", storm_data$event_type)] <- "extreme cold"
The “tornado” event type has the second most fatalities according to our table. “Tornado” is coded in several different ways, as was heat, so let’s create an event class for “tornado”.
storm_data$event_class[grepl("torn", storm_data$event_type)] <- "tornado"
I will do the same for a few of the other event types. I’ll spare you the gory details.
storm_data$event_class[grepl("flood|fld", storm_data$event_type)] <- "flood"
storm_data$event_class[grepl("fire", storm_data$event_type)] <- "fire"
storm_data$event_class[grepl("aval", storm_data$event_type)] <- "avalanche"
storm_data$event_class[grepl("rip", storm_data$event_type)] <- "rip current"
storm_data$event_class[grepl("thunder|tstm", storm_data$event_type)] <- "thunderstorm winds"
storm_data$event_class[grepl("hurricane", storm_data$event_type)] <- "hurricane"
storm_data$event_class[grepl("tropical", storm_data$event_type)] <- "tropical storm"
storm_data$event_class[grepl("wint|blizz", storm_data$event_type)] <- "winter storm"
storm_data$event_class[grepl("strong wind|high wind|gusty wind", storm_data$event_type)] <- "strong wind"
storm_data$event_class[grepl("ice", storm_data$event_type)] <- "ice"
storm_data$event_class[grepl("surge", storm_data$event_type)] <- "storm surge"
Let’s check our progress by making a table of the event classes including how many records are in each, and how many fatalities are attributed to each.
event_classes <- storm_data %>%
group_by(event_class) %>%
summarize(n = n(), deaths = sum(fatalities)) %>%
arrange(desc(deaths)) %>%
filter(deaths > 50) %>%
ungroup()
kable(as.data.frame(event_classes))
| event_class | n | deaths |
|---|---|---|
| extreme heat | 2602 | 3087 |
| tornado | 24373 | 1545 |
| flood | 82613 | 1414 |
| lightning | 14280 | 729 |
| rip current | 766 | 569 |
| thunderstorm winds | 234076 | 440 |
| extreme cold | 2373 | 403 |
| strong wind | 24797 | 386 |
| winter storm | 22237 | 328 |
| avalanche | 380 | 223 |
| hurricane | 281 | 133 |
| heavy snow | 14710 | 115 |
| high surf | 730 | 102 |
| heavy rain | 11640 | 95 |
| ice | 2033 | 93 |
| fire | 4216 | 87 |
| fog | 535 | 61 |
| tropical storm | 749 | 57 |
We could keep classifying, but what we’ve done so far is enough to capture the top 10 event types in terms of impact on human population.
Since the types of events that affect crops are different than those that affect population and property, let’s investigate whether we need to do any more classification to capture crop damage.
crop_by_type <- storm_data %>%
group_by(event_type) %>%
summarize(n = n(), total_crop_damage= sum(crop_damage_dollars)) %>%
arrange(desc(total_crop_damage)) %>%
filter(total_crop_damage > 10000000) %>%
ungroup()
kable(as.data.frame(crop_by_type))
| event_type | n | total_crop_damage |
|---|---|---|
| drought | 2471 | 13922066000 |
| flood | 24642 | 5422810400 |
| hurricane | 170 | 2741410000 |
| hail | 215932 | 2614127050 |
| hurricane/typhoon | 88 | 2607872800 |
| flash flood | 52673 | 1343915000 |
| extreme cold | 634 | 1312473000 |
| frost/freeze | 1343 | 1094186000 |
| heavy rain | 11640 | 728399800 |
| tropical storm | 684 | 677836000 |
| high wind | 19958 | 633561300 |
| tstm wind | 128925 | 553947350 |
| excessive heat | 1675 | 492402000 |
| thunderstorm wind | 81746 | 414354000 |
| freeze | 74 | 406725000 |
| heat | 755 | 401411500 |
| tornado | 24335 | 296595610 |
| damaging freeze | 8 | 296230000 |
| wildfire | 2757 | 295472800 |
| excessive wetness | 1 | 142000000 |
| hurricane erin | 7 | 136010000 |
| heavy snow | 14710 | 129153100 |
| flood/rain/winds | 6 | 112800000 |
| wild/forest fire | 1446 | 106782830 |
| flood/flash flood | 258 | 94979000 |
| cold and wet conditions | 1 | 66000000 |
| strong wind | 3565 | 64953500 |
| tstm wind/hail | 1028 | 64696250 |
| thunderstorm winds | 10041 | 60172900 |
| early frost | 1 | 42000000 |
| high winds | 776 | 34170000 |
| severe thunderstorm winds | 5 | 29000000 |
| agricultural freeze | 6 | 28820000 |
| river flooding | 18 | 28020000 |
| unseasonably cold | 23 | 25042500 |
| small hail | 52 | 20793000 |
| landslide | 599 | 20017000 |
| hurricane opal | 9 | 19000000 |
| river flood | 148 | 18909000 |
| extreme windchill | 204 | 17000000 |
| severe thunderstorms | 20 | 17000000 |
| tropical storm jerry | 3 | 16000000 |
| ice storm | 1928 | 15660000 |
| winter weather | 6992 | 15000000 |
| hard freeze | 7 | 13100000 |
| winter storm | 11372 | 11944000 |
| lightning | 14280 | 10766040 |
| flash flooding | 295 | 10075000 |
The processing we have already done will help clean this up, but there are a few event types that we didn’t consider in our analysis of the fatality events. Let’s add classes for “drought”, “frost/freeze”, “hail”, and “heavy rain”.
storm_data$event_class[grepl("drought", storm_data$event_type)] <- "drought"
storm_data$event_class[grepl("freeze|frost", storm_data$event_type)] <- "frost/freeze"
storm_data$event_class[grepl("hail", storm_data$event_type)] <- "hail"
storm_data$event_class[grepl("heavy rain|excessive wetness", storm_data$event_type)] <- "heavy rain"
Let’s see how the classification cleans up this table.
crop_by_type <- storm_data %>%
group_by(event_class) %>%
summarize(n = n(), total_crop_damage= sum(crop_damage_dollars)) %>%
arrange(desc(total_crop_damage)) %>%
filter(total_crop_damage > 5000000) %>%
ungroup()
kable(head(as.data.frame(crop_by_type), 10))
| event_class | n | total_crop_damage |
|---|---|---|
| drought | 2493 | 13922121780 |
| flood | 82604 | 7040614500 |
| hurricane | 281 | 5504792800 |
| hail | 217619 | 2699785300 |
| frost/freeze | 1502 | 1886061000 |
| extreme cold | 2364 | 1409265500 |
| thunderstorm winds | 232977 | 1075788250 |
| extreme heat | 2587 | 899313500 |
| heavy rain | 11681 | 871909800 |
| strong wind | 24795 | 747914800 |
cat('\n')
We could keep going but this is enough classification to proceed with the analysis and get some results.
population_impact <- storm_data %>%
group_by(event_class)%>%
summarize(total_deaths = sum(fatalities),
total_injuries = sum(injuries))
injuries <- population_impact %>%
arrange(total_injuries) %>%
mutate(event_class = factor(event_class, event_class)) %>%
arrange(desc(total_injuries))
fatalities <- population_impact %>%
arrange(total_deaths) %>%
mutate(event_class = factor(event_class, event_class)) %>%
arrange(desc(total_deaths))
p1 <- fatalities[1:10, ] %>%
ggplot(aes(x = event_class, y = total_deaths)) +
geom_col(fill = 'salmon') +
coord_flip() +
labs(x = "", y = "Total Fatalities")
p2 <- injuries[1:10, ] %>%
ggplot(aes(x = event_class, y = total_injuries)) +
geom_col(fill = 'salmon') +
coord_flip() +
labs(x = "", y = "Total Injuries")
grid.arrange(p1, p2, nrow = 1, top = "Weather related fatalities and injuries in the US, 1995-2011")
Extreme heat, tornados, and flooding have caused the most population damage of all severe weather types from 1995 - 2011 in the US. Extreme heat has led to more fatalities than tornadoes, although tornadoes have caused more injuries.
damages <- storm_data %>%
group_by(event_class) %>%
summarize(total_property_damage = sum(property_damage_dollars),
total_crop_damage = sum(crop_damage_dollars)) %>%
ungroup()
property <- damages %>%
arrange(total_property_damage) %>%
mutate(event_class = factor(event_class, event_class)) %>%
arrange(desc(total_property_damage))
crops<- damages %>%
arrange(total_crop_damage) %>%
mutate(event_class = factor(event_class, event_class)) %>%
arrange(desc(total_crop_damage))
p3 <- property[1:10, ] %>%
ggplot(aes(x = event_class, y = total_property_damage/10^9)) +
geom_col(fill = 'darkseagreen') +
coord_flip() +
labs(x = "", y = "Total property damage\n (in billions of US dollars)")
p4 <- crops[1:10, ] %>%
ggplot(aes(x = event_class, y = total_crop_damage/10^9)) +
geom_col(fill = 'darkseagreen') +
coord_flip() +
labs(x = "", y = "Total crop damage\n (in billions of US dollars)")
grid.arrange(p3, p4, nrow = 1, top = "Weather related damages to property and crops in the US, 1995-2011")
Not surprisingly, flooding, hurricanes, and storm surges have caused more property damage than any other types of severe weather. Tornadoes, hail, and strong winds also have a significant economic impact when it comes to property.
Drought has caused the highest damages to crops. Flooding, hurricanes, hail, and freezing weather are also leading contributors to crop damages.