The objective of this study is to identify which severe weather events had the most impact on peoples overall health and on economy in the US between the years 1950 and 2011. To answer this questions we will rely on the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database, which contains detailed data about major storm and weather events across multiple regions in the United States, as well as estimate measurements of the number of injuries and fatalities and the financial consequences from damages caused by each event. With the results of this study, we hope to bring insights to governmental managers all over the US so they can better prepare for future incidents.
From the Storm Data Download we can obtain the storm database just as specified above. For further and updated information about this database you can refer to the NOAA Database Website. You can also read the Storm Data Documentation and National Climatic Datacenter Storm Events FAQ.
library(tidyverse)
library(scales)
library(lubridate)
First we get the work directory ready and organized to receive the data. Then we download the .bz2 into the data folder.
# Check if data folder exists in the work directory
if(!file.exists("./data")){
dir.create("./data")
}
# Check if project file exists within the data folder. If not, download it.
if(!file.exists("./data/StormData.csv.bz2")){
file_url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(file_url, destfile = "./data/StormData.csv.bz2")
}
To read the data we will use the readr package (automatically loaded with the tidyverse package). It is important to note that the dataset has some pattern issues and changes in some of its data across all the years, so the guess_max argument in the following code chunk came in handy for that problem, specially in the early years. The reading process may take a bit longer though.
# The following guess_max arg from the readr::read_csv function was helpful to determine the overall data types for each variable in the dataset.
# The number 902297 is the number of obs in the dataset.
storm_data <- read_csv("./data/StormData.csv.bz2", guess_max = 902297)
The data frame dimensions and structure can be checked as follows.
dim(storm_data) # Check dimensions
## [1] 902297 37
str(storm_data, give.attr = FALSE) # Check structure
## tibble [902,297 x 37] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ STATE__ : num [1:902297] 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr [1:902297] "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr [1:902297] "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr [1:902297] "CST" "CST" "CST" "CST" ...
## $ COUNTY : num [1:902297] 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr [1:902297] "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr [1:902297] "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr [1:902297] "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num [1:902297] 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr [1:902297] NA NA NA NA ...
## $ BGN_LOCATI: chr [1:902297] NA NA NA NA ...
## $ END_DATE : chr [1:902297] NA NA NA NA ...
## $ END_TIME : chr [1:902297] NA NA NA NA ...
## $ COUNTY_END: num [1:902297] 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi [1:902297] NA NA NA NA NA NA ...
## $ END_RANGE : num [1:902297] 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr [1:902297] NA NA NA NA ...
## $ END_LOCATI: chr [1:902297] NA NA NA NA ...
## $ LENGTH : num [1:902297] 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num [1:902297] 100 150 123 100 150 177 33 33 100 100 ...
## $ F : num [1:902297] 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num [1:902297] 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num [1:902297] 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num [1:902297] 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num [1:902297] 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr [1:902297] "K" "K" "K" "K" ...
## $ CROPDMG : num [1:902297] 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr [1:902297] NA NA NA NA ...
## $ WFO : chr [1:902297] NA NA NA NA ...
## $ STATEOFFIC: chr [1:902297] NA NA NA NA ...
## $ ZONENAMES : chr [1:902297] NA NA NA NA ...
## $ LATITUDE : num [1:902297] 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num [1:902297] 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num [1:902297] 3051 0 0 0 0 ...
## $ LONGITUDE_: num [1:902297] 8806 0 0 0 0 ...
## $ REMARKS : chr [1:902297] NA NA NA NA ...
## $ REFNUM : num [1:902297] 1 2 3 4 5 6 7 8 9 10 ...
unique_events <- length(unique(storm_data$EVTYPE)) # Calculate unique event types
We can also see that, in its current state, the storm_data data frame contains 977 unique event types, which is a large number of event types if compared to the 48 official event types mentioned in the NOAA Database Website and Storm Data Documentation. This is probably due to lack of event names standardization along the years.
We have a total of 37 variables with 902,297 observations. To properly answer our questions we will only need 9 of them, which are:
BGN_DATE → Begin Date
END_DATE → End Date
EVTYPE → Event Type
FATALITIES → Estimate number of fatalities caused by the event
INJURIES → Estimate number of injuries caused by the event
PROPDMG → Value of property damage caused by the event
PROPDMGEXP → Exponent of the property damage value (e.g., K fo thousand, M for million, B for billion, H for hundred, integer i for 10^i)1
CROPDMG → Value of crop damage caused by the event
CROPDMGEXP → Exponent of the crop damage value (e.g., K fo thousand, M for million, B for billion, H for hundred, integer i for 10^i)2
In order to subset and improve the originally created storm_data data frame we will use the dplyr package (also automatically loaded with the tidyverse package). For that we will: keep only the variables mentioned above, create two new damage value variables based on their respective exponent, create a new variable with the sum of injuries and fatalities together for each observation, create a new variable with the sum of damage value of property and crop categories together, properly format the dates variables as date class and set the event types as upper case character class. We will also filter out observations in which both health and economic impact equals 0. Finally, to avoid possible distortion on the events impact proportionality, we will only keep the records of events happening from 1996 to 2011. As stated in the NOAA Database Website, the properly recording of the complete list of event types only started in 1996 as defined in NWS Directive 10-1605.
As a last complementary step, we will also try to reduce some of the standardization problem by changing some event type names that should belong in the same group. Those changes were chosen by analyzing the preprocessed data and identifying some health and economic relevant events that should be grouped together.
storm_data_clean <- storm_data %>%
# Subset needed variables
select(BGN_DATE, END_DATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP) %>%
# This step will take the property damage value and multiply it by is exponent based on the rules previously mentioned
mutate(
prop_dmg_val = PROPDMG * case_when(
tolower(PROPDMGEXP) == "k" ~ 10^3,
tolower(PROPDMGEXP) == "m" ~ 10^6,
tolower(PROPDMGEXP) == "b" ~ 10^9,
tolower(PROPDMGEXP) == "h" ~ 10^2,
!is.na(as.numeric(PROPDMGEXP)) ~ 10^as.numeric(PROPDMGEXP),
TRUE ~ 10^0
),
# This step will take the crop damage value and multiply it by is exponent based on the rules previously mentioned
crop_dmg_val = CROPDMG * case_when(
tolower(CROPDMGEXP) == "k" ~ 10^3,
tolower(CROPDMGEXP) == "m" ~ 10^6,
tolower(CROPDMGEXP) == "b" ~ 10^9,
tolower(CROPDMGEXP) == "h" ~ 10^2,
!is.na(as.numeric(CROPDMGEXP)) ~ 10^as.numeric(CROPDMGEXP),
TRUE ~ 10^0
)
) %>%
# Sum of health related variables into the health_harm var and economic related variables into the economic_conseq var
mutate(health_harm = FATALITIES + INJURIES, economic_conseq = prop_dmg_val + crop_dmg_val) %>%
# Overall formatting
mutate(BGN_DATE = as.Date(BGN_DATE, format = "%m/%d/%Y %H:%M:%S"),
END_DATE = as.Date(END_DATE, format = "%m/%d/%Y %H:%M:%S"),
EVTYPE = toupper(EVTYPE)
) %>%
# Filter irrelevant observations and keep only those happening between 1996 and 2011
filter(health_harm + economic_conseq > 0, year(BGN_DATE) >= 1996) %>%
# Standardize names of relevant event types that belong to the same event
mutate(EVTYPE = case_when(
EVTYPE == "HURRICANE/TYPHOON" ~ "HURRICANE",
EVTYPE == "WILD/FOREST FIRE" ~ "WILDFIRE",
EVTYPE == "RIP CURRENTS" ~ "RIP CURRENT",
EVTYPE == "STRONG WINDS" ~ "STRONG WIND",
substr(EVTYPE,1, 13) == "COASTAL FLOOD" ~ "COASTAL FLOOD",
TRUE ~ EVTYPE
)
)
# Calculates the number of unique events in the new data frame
unique_events <- length(unique(storm_data_clean$EVTYPE))
With the cleaning, subset and formatting of the original data frame, we will end up with 177 types of events, which is much better than the original number, but yet not equal to the number reported in the official resources. Despite this fact, we will advance in the analysis as it is in its current state, since it would be pretty work intensive to manually summarize that variable into 48 categories, specially considering that those changes would probably be irrelevant for the results we seek.
To answer the first question, we will analyze the health harm separately from the economic consequences. For that we will subset and summarize the previously created cleaned data to make it suitable to our needs. We will also visualize only the 20 most meaningful events and will be able to check how relevant they are to the fatalities/injuries numbers.
storm_data_health <- storm_data_clean %>%
group_by(EVTYPE) %>%
summarise(total_health = sum(health_harm)) %>%
arrange(desc(total_health)) %>%
mutate(EVTYPE = fct_reorder(EVTYPE, desc(total_health))) %>% # Reorder levels of event types by relevance
mutate(cum_percent = cumsum( total_health / sum(total_health)) ) %>% # Calculate the cumulative sum over the total
slice_max(total_health, n = 20) # Keep only the 20 most relevant event types
The health harm data frame can be seen below.
storm_data_health
## # A tibble: 20 x 3
## EVTYPE total_health cum_percent
## <fct> <dbl> <dbl>
## 1 TORNADO 22178 0.332
## 2 EXCESSIVE HEAT 8188 0.455
## 3 FLOOD 7172 0.563
## 4 LIGHTNING 4792 0.635
## 5 TSTM WIND 3870 0.693
## 6 FLASH FLOOD 2561 0.731
## 7 WILDFIRE 1543 0.754
## 8 THUNDERSTORM WIND 1530 0.777
## 9 WINTER STORM 1483 0.799
## 10 HEAT 1459 0.821
## 11 HURRICANE 1446 0.843
## 12 HIGH WIND 1318 0.863
## 13 RIP CURRENT 1045 0.878
## 14 HEAVY SNOW 805 0.890
## 15 FOG 772 0.902
## 16 HAIL 720 0.913
## 17 BLIZZARD 455 0.919
## 18 STRONG WIND 409 0.926
## 19 ICE STORM 400 0.932
## 20 TROPICAL STORM 395 0.938
With the new data frame we can now plot the data to find answers to our first question.
max_health_harm = max(storm_data_health$total_health) # This code is used to create a second proportional y label in the following plot
ggplot(storm_data_health, aes(x = EVTYPE)) +
theme_light() +
geom_col(mapping = aes(y = total_health), fill = "steelblue", color = "black", alpha = 0.5) +
geom_line(aes(y = cum_percent * max_health_harm, group = NA), size = 0.5, color = "red") +
labs(x = "Event Type", y = "Total Health Impact (Fatalities + Injuries)") +
# Format label scales
scale_y_continuous(
labels = scales::comma,
sec.axis = sec_axis( trans=~./max_health_harm, name="Percent", labels = scales::percent_format(accuracy = 1))
) +
theme(
axis.text.x = element_text(angle = 90),
axis.text.y.right = element_text(color = "red"),
axis.title.y.right = element_text(color = "red")
)
As we can see, regarding population health (this includes the sum of fatalities and injuries), from 1996 to 2011 tornadoes were, by far, the most harmful event, being the cause of 33.2% of total deaths plus injuries in that period, followed by excessive heat with 12.3% and flood with 10.8%. The top 20 storm events were responsible for 93.8% of total fatalities plus injuries.
The second question in our analysis will be answered in a similar way as the previous one, considering, however, only the observations with economical consequences.
storm_data_econ <- storm_data_clean %>%
group_by(EVTYPE) %>%
summarise(total_econ = sum(economic_conseq)) %>%
arrange(desc(total_econ)) %>%
mutate(EVTYPE = fct_reorder(EVTYPE, desc(total_econ)), total_econ = total_econ / 10^9) %>% # Reorder levels of event types by relevance and change total_econ scale
mutate(cum_percent = cumsum( total_econ / sum(total_econ)) ) %>% # Calculate the cumulative sum over the total
slice_max(total_econ, n = 20) # Keep only the 20 most relevant event types
The economical impact data frame can be seen below (total_econ var in Billions).
storm_data_econ
## # A tibble: 20 x 3
## EVTYPE total_econ cum_percent
## <fct> <dbl> <dbl>
## 1 FLOOD 149. 0.371
## 2 HURRICANE 86.5 0.586
## 3 STORM SURGE 43.2 0.694
## 4 TORNADO 24.9 0.756
## 5 HAIL 17.1 0.798
## 6 FLASH FLOOD 16.6 0.840
## 7 DROUGHT 14.4 0.875
## 8 TROPICAL STORM 8.32 0.896
## 9 WILDFIRE 8.16 0.917
## 10 HIGH WIND 5.88 0.931
## 11 TSTM WIND 5.04 0.944
## 12 STORM SURGE/TIDE 4.64 0.955
## 13 THUNDERSTORM WIND 3.78 0.965
## 14 ICE STORM 3.66 0.974
## 15 WINTER STORM 1.54 0.978
## 16 EXTREME COLD 1.33 0.981
## 17 HEAVY RAIN 1.31 0.984
## 18 FROST/FREEZE 1.10 0.987
## 19 LIGHTNING 0.750 0.989
## 20 HEAVY SNOW 0.706 0.991
Now, to the plot that will answer our second question.
max_econ_conseq = max(storm_data_econ$total_econ) # This code is used to create a second proportional y label in the following plot
ggplot(storm_data_econ, aes(x = EVTYPE)) +
theme_light() +
geom_col(mapping = aes(y = total_econ), fill = "seagreen", color = "black", alpha = 0.5) +
geom_line(aes(y = cum_percent * max_econ_conseq, group = NA), size = 0.5, color = "red") +
labs(x = "Event Type", y = "Total Economic Impact (Billions)") +
# Format label scales
scale_y_continuous(
labels = unit_format(unit = "B"),
sec.axis = sec_axis( trans=~./max_econ_conseq, name="Percent", labels = scales::percent_format(accuracy = 1))
) +
theme(
axis.text.x = element_text(angle = 90),
axis.text.y.right = element_text(color = "red"),
axis.title.y.right = element_text(color = "red")
)
Regarding the economical impact, we can clearly see that floods was the main event type to damage properties and crops between 1996 and 2011, being responsible for 37.1% of total damage value, followed by hurricanes with 21.5% and storm surges with 10.8%. In this case, the top 20 storm events were responsible for 99.1% of total damage value.