Natural disasters, such as storms and severe weather events, are a source calamity from a economic and social perspective. Analyzing the United States (U.S.) National Oceanic and Atmospheric Administration’s (NOAA) storm database, first, I inspect the evolution of the total number of servere events in U.S. over the 1950-2011 period, as well as analyze how the occurrency of the events differ across the months. After that, I analyze the most recurrent events in U.S. during the available period and their impact over the population health. Additionally, I show the most harmful events regarding economic impact in the U.S..
For the analysis, I will be using data from NOAA storm database, which can be downloaded here.
Once the dataset is downloaded, we can unzip the file. As I am using a Linux machine, I set the argument unzip = /usr/bin/bunzip2 to specify the method of unziping to use in the following command:
unzip(zipfile = "repdata%2Fdata%2FStormData.csv.bz2",
unzip = "/usr/bin/bunzip2")
Now the the dataset is finally downloaded and unziped, we are ready to read it:
library(data.table)
library(dplyr)
foo <- fread("storm_data")
foo %>% head
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1: 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2: 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3: 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4: 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL
## 5: 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## 6: 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1: TORNADO 0 0
## 2: TORNADO 0 0
## 3: TORNADO 0 0
## 4: TORNADO 0 0
## 5: TORNADO 0 0
## 6: TORNADO 0 0
## COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1: NA 0 14.0 100 3 0 0
## 2: NA 0 2.0 150 2 0 0
## 3: NA 0 0.1 123 2 0 0
## 4: NA 0 0.0 100 2 0 0
## 5: NA 0 0.0 150 2 0 0
## 6: NA 0 1.5 177 2 0 0
## INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1: 15 25.0 K 0
## 2: 0 2.5 K 0
## 3: 2 25.0 K 0
## 4: 2 2.5 K 0
## 5: 2 2.5 K 0
## 6: 6 2.5 K 0
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1: 3040 8812 3051 8806 1
## 2: 3042 8755 0 0 2
## 3: 3340 8742 0 0 3
## 4: 3458 8626 0 0 4
## 5: 3412 8642 0 0 5
## 6: 3450 8748 0 0 6
Now let’s perform the data cleaning process. First, I transform the Date variable into using the lubridate package and then create the variables year and month:
library(lubridate)
foo <- foo %>%
mutate(BGN_DATE = mdy_hms(BGN_DATE)) %>%
as.data.table()
## CREATE YEAR AND MONTH VARIABLES::
foo <- foo %>%
mutate(year = lubridate::year(BGN_DATE),
month = lubridate::month(BGN_DATE, label = T)) %>%
as.data.table()
foo %>% head
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE
## 1: 1 1950-04-18 0130 CST 97 MOBILE AL TORNADO
## 2: 1 1950-04-18 0145 CST 3 BALDWIN AL TORNADO
## 3: 1 1951-02-20 1600 CST 57 FAYETTE AL TORNADO
## 4: 1 1951-06-08 0900 CST 89 MADISON AL TORNADO
## 5: 1 1951-11-15 1500 CST 43 CULLMAN AL TORNADO
## 6: 1 1951-11-15 2000 CST 77 LAUDERDALE AL TORNADO
## BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## 1: 0 0 NA
## 2: 0 0 NA
## 3: 0 0 NA
## 4: 0 0 NA
## 5: 0 0 NA
## 6: 0 0 NA
## END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES
## 1: 0 14.0 100 3 0 0 15
## 2: 0 2.0 150 2 0 0 0
## 3: 0 0.1 123 2 0 0 2
## 4: 0 0.0 100 2 0 0 2
## 5: 0 0.0 150 2 0 0 2
## 6: 0 1.5 177 2 0 0 6
## PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE
## 1: 25.0 K 0 3040
## 2: 2.5 K 0 3042
## 3: 25.0 K 0 3340
## 4: 2.5 K 0 3458
## 5: 2.5 K 0 3412
## 6: 2.5 K 0 3450
## LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM year month
## 1: 8812 3051 8806 1 1950 Apr
## 2: 8755 0 0 2 1950 Apr
## 3: 8742 0 0 3 1951 Feb
## 4: 8626 0 0 4 1951 Jun
## 5: 8642 0 0 5 1951 Nov
## 6: 8748 0 0 6 1951 Nov
Now, I transform the characters in EVTYPE variable to lower case and remove any special character and extra spaces:
library(stringr)
## TO LOWECASE:
foo <- foo %>%
mutate(EVTYPE = str_to_lower(EVTYPE)) %>%
as.data.table()
## REMOVE SPECIAL CHARACTERS:
foo <- foo %>%
mutate(EVTYPE = str_replace_all(string = EVTYPE, pattern = "[[:punct:]]", replacement = " ")) %>%
as.data.table()
## TRIM EXTRA BLANK SPACES:
foo <- foo %>%
mutate(EVTYPE = trimws(x = EVTYPE, which = "both")) %>%
as.data.table()
The EVTYPE variable contains information for the kind of event happened. According to the Storm Data Documentation, there are 48 event type labels. Inspecting this variable closer, I notice some typos. In order to clean this variable, I try to group each classification in one of the 48 pre-classified groups.
event_types <- c("Astronomical Low Tide", "Avalanche", "Blizzard", "Coastal Flood", "Cold Wind Chill",
"Debris Flow", "Dense Fog", "Dense Smoke", "Drought", "Dust Devil",
"Dust Storm", "Excessive Heat", "Extreme Cold Wind Chill", "Flash Flood",
"Flood", "Freezing Fog", "Frost Freeze", "Funnel Cloud", "Hail", "Heat",
"Heavy Rain", "Heavy Snow", "High Surf", "High Wind", "Hurricane Typhoon",
"Ice Storm", "Lakeshore Flood", "Lake Effect Snow", "Lightning", "Marine Hail",
"Marine High Wind", "Marine Strong Wind", "Marine Thunderstorm Wind", "Rip Current",
"Seiche", "Sleet", "Storm Surge Tide", "Strong Wind", "Thunderstorm Wind", "Tornado",
"Tropical Depression", "Tropical Storm", "Tsunami", "Volcanic Ash", "Waterspout",
"Wildfire", "Winter Storm", "Winter Weather")
for( i in event_types){
index <- agrep(pattern = i, x = foo$EVTYPE, ignore.case = T)
if( length(index) == 0 ) next
foo[index, EVTYPE := i]
}
After that, I still needed to padronize some of the classification:
foo <- foo %>%
mutate(EVTYPE = ifelse(EVTYPE == "tstm wind", "Thunderstorm Wind", EVTYPE)) %>%
as.data.table()
foo <- foo %>%
mutate(EVTYPE = ifelse(EVTYPE == "marine tstm wind", "Marine Thunderstorm Wind", EVTYPE)) %>%
as.data.table()
foo <- foo %>%
mutate(EVTYPE = ifelse(EVTYPE == "wild forest fire", "Wildfire", EVTYPE)) %>%
as.data.table()
foo <- foo %>%
mutate(EVTYPE = ifelse(EVTYPE == "extreme cold", "Extreme Cold Wind Chill", EVTYPE)) %>%
as.data.table()
foo <- foo %>%
mutate(EVTYPE = ifelse(EVTYPE == "snow", "Heavy Snow", EVTYPE)) %>%
as.data.table()
foo <- foo %>%
mutate(EVTYPE = ifelse(EVTYPE == "landslide", "Debris Flow", EVTYPE)) %>%
as.data.table()
foo <- foo %>%
mutate(EVTYPE = ifelse(EVTYPE == "hurricane", "Hurricane Typhoon", EVTYPE)) %>%
as.data.table()
foo <- foo %>%
mutate(EVTYPE = ifelse(EVTYPE == "storm surge", "Storm Surge Tide", EVTYPE)) %>%
as.data.table()
Now I have a dataset that is closer to the 48 pre-classified groups, although there are still some labels not defined in the documentation. I choose to ignore this typos as they do not have a significant number to affect this analysis.
Now let’s pay attention to the variables CROPDMGEXP and PROPDMGEXP. As the documentation states that Alphabetical characters used to signify magnitude include “K” for thousands, “M” for millions, and “B” for billions. Hence, I tranform this variables accordingly:
foo$CROPDMGEXP <- str_to_upper(foo$CROPDMGEXP)
foo$PROPDMGEXP <- str_to_upper(foo$PROPDMGEXP)
economic_impact <- foo %>%
mutate(CROPDMGEXP = case_when(CROPDMGEXP == "K" ~ 1000,
CROPDMGEXP == "M" ~ 1000000,
CROPDMGEXP == "B" ~ 1000000000,
TRUE ~ 0),
PROPDMGEXP = case_when(PROPDMGEXP == "K" ~ 1000,
PROPDMGEXP == "M" ~ 1000000,
PROPDMGEXP == "B" ~ 1000000000,
TRUE ~ 0)) %>%
select(EVTYPE, CROPDMG, CROPDMGEXP, PROPDMG, PROPDMGEXP) %>%
as.data.table()
Let’s take a look at the evolution of the number of severe events during the 1950-2011 period in U.S.:
storm_by_year <- foo[, .N, by = .(year)] %>% arrange(year)
library(ggplot2)
## Number of severe events by year:
ggplot(data = storm_by_year) + aes(x = year, y = N) +
geom_line(size = 0.5) +
geom_point(size = 1) +
theme_minimal() +
scale_x_continuous(breaks = seq(min(storm_by_year$year), max(storm_by_year$year), by = 10)) +
labs(title = "Number of Severe Events (1950-2011)",
y = "Occurrency", x = "Year") +
theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
axis.text = element_text(size = 10),
axis.title = element_text(size = 12))
As we can see from the previous plot, the number of registered severe events surge after 1990. This is an interesting evidence, however we can not state that the from 1990 we observe more events. It states that the number of registered servere events is more pronounced during this period.
Now let’s investigate how these events spread across the months.
storm_by_month <- foo[, .N, by = .(month)] %>% arrange(month)
storm_by_month <- storm_by_month %>%
mutate(month = factor(month, levels = month)) %>%
as.data.table()
ggplot(data = storm_by_month) + aes(x = month, y = N) +
geom_bar(stat = "identity", fill = "red") +
labs(title = "Severe Events by Month (1950-2011)",
y = "Occurrency", x = "Year") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
axis.text = element_text(size = 10),
axis.title = element_text(size = 12))
The histogram shows that the majority of the events concentrate among the April and August, which means that during these months the frequency to face this events is higher.
Once, I show the number of registered events has been growing during the 1950-2011 period and these events concentrate during a particular months, one may ask what is the most recurrent events in the country. The following table can give as the answer:
library(kableExtra)
top_10_event <- foo %>%
group_by(EVTYPE) %>%
summarise(N = n()) %>%
arrange(-N) %>%
as.data.table() %>%
head(10)
top_10_event %>%
kable() %>%
kable_styling(bootstrap_options = "striped", full_width = F)
EVTYPE | N |
---|---|
Thunderstorm Wind | 329330 |
Hail | 292207 |
Flood | 82734 |
Tornado | 60699 |
Heat | 38755 |
High Wind | 21923 |
Lightning | 15777 |
Winter Storm | 11436 |
Funnel Cloud | 6938 |
Marine Thunderstorm Wind | 6175 |
The previous table shows the top 10 most recurrent severe events in U.S. As we can see, Thunderstorm Wind is the most recurrent event, followed by Hail and Flood. But, how harmful are these events for the country?
The following figure shows the most harmful events for population health. Here, I use the number of fatalities and injuries as a proxy for population health.
top_10_harmful <- foo %>%
group_by(EVTYPE) %>%
summarise(fatalities = sum(FATALITIES + INJURIES)) %>%
arrange(-fatalities) %>%
as.data.table() %>%
head(10)
top_10_harmful <- top_10_harmful %>%
mutate(EVTYPE = factor(EVTYPE, levels = EVTYPE)) %>%
as.data.table()
levels(top_10_harmful$EVTYPE) <- gsub(pattern = " ", replacement = "\n", x = levels(top_10_harmful$EVTYPE))
ggplot(data = top_10_harmful) +
aes(x = as.factor(EVTYPE), y = fatalities/1000) +
geom_bar(stat = "identity", width = .8, fill = "red") +
theme_minimal() +
labs(y = "Number of Fatalities and Injuries", x = "Event Type",
title = "Top 10 most harmful events across U.S. (in thousands) ") +
theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
axis.text = element_text(size = 10),
axis.title = element_text(size = 12))
As we can see, Tornado is the event that cause more fatalities and/or injuries in America, followed by Heat and Flood. Also, we can notice that 80% of the most harmful events are among the top 10 most recurrent events in the country. That is, these events tend to occur more frequently and cause more damages to the population health.
Now, let’s investigate the severe events that have the most economic impact across US. Here, I use the variables CROPDMG and PROPDMG as proxies for economic impact, as these variables describe Crop and property damage, respectively. Then, I multiply these column by CROPDMGEXP and PROPDMGEXP, which are the representative scale. Finally I add up by eventtypes. The following figure shows the results:
economic_impact <- economic_impact %>%
mutate(CROPDMG = CROPDMG * CROPDMGEXP,
PROPDMG = PROPDMG * PROPDMGEXP) %>%
select(EVTYPE, PROPDMG, CROPDMG) %>%
as.data.table()
economic_impact <- economic_impact[, .(economic_impact = sum(PROPDMG + CROPDMG)), by = .(EVTYPE)] %>%
arrange(-economic_impact, EVTYPE) %>% head(10)
economic_impact <- economic_impact %>%
arrange(economic_impact) %>%
mutate(economic_impact = economic_impact/1000000000,
EVTYPE = factor(EVTYPE, levels = EVTYPE)) %>%
as.data.table()
ggplot(data = economic_impact) +
aes(x = EVTYPE, y = economic_impact) +
geom_bar(stat = "identity", fill = "red") +
coord_flip() +
theme_minimal() +
labs(y = "", x = "Event Type",
title = "Economic damage caused by severe events 1950-2011 (in billion U.S. dollars)
",
subtitle = "") +
theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
axis.text = element_text(size = 10),
axis.title = element_text(size = 12))
From previous figure, we can that Flood, Hurricane/typhons and Tornado cause the most economic impact in the country during the period. Additionally, we notice that 4 of the most economic damage events are among the top 10 most recurrrent and poulation health damage.
Natural disasters, such as storms and severe weather events, are a source calamity from a economic and social perspective. In this small exercise, I analyze the United States (U.S.) National Oceanic and Atmospheric Administration’s (NOAA) storm database. The results show that the number of registered severe events increased over the year, with a surge after the 1990. Also, within years, the events tend to concentrate between April and August. Finally, I showed that almost half of the events that have the most economic and population health damages in the country are among the most recurrent events during 1950-2011 period.