Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This project explores the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
The basic goal of this assignment is to explore the NOAA Storm Database and answer some basic questions about severe weather events.
Across the United States, which types of events are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?
library(tidyverse)
library(lubridate)
library(pdftools)
library(tm)
library(stringdist)
# NOAA <- read_csv("./repdata_data_StormData.csv.bz2", cols(CROPDMGEXP = col_character()))
NOAA <- read_csv("repdata_data_StormData.csv",
col_types = cols(BGN_TIME = col_time(format = "%H%M"),
CROPDMGEXP = col_character()))
We want to determine the most costly disasters so we have to examine the the property damage (PROPDMG) and crop damage (CROPDMG) values. These are simple integers which must be multiplied by an exponent given in another field (PROPDMGEXP and CROPDMGEXP respectively. Unfortunately, the exponent values are messy and they need to be converted properly before we can actually calculate the actual values.
raw <- as.tibble(NOAA)
head(raw)
## # A tibble: 6 x 37
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE BGN_RANGE
## <dbl> <chr> <time> <chr> <dbl> <chr> <chr> <chr> <dbl>
## 1 1 4/18/19~ 01:30 CST 97 MOBILE AL TORNA~ 0
## 2 1 4/18/19~ 01:45 CST 3 BALDWIN AL TORNA~ 0
## 3 1 2/20/19~ 16:00 CST 57 FAYETTE AL TORNA~ 0
## 4 1 6/8/195~ 09:00 CST 89 MADISON AL TORNA~ 0
## 5 1 11/15/1~ 15:00 CST 43 CULLMAN AL TORNA~ 0
## 6 1 11/15/1~ 20:00 CST 77 LAUDERDALE AL TORNA~ 0
## # ... with 28 more variables: BGN_AZI <lgl>, BGN_LOCATI <lgl>, END_DATE <lgl>,
## # END_TIME <lgl>, COUNTY_END <dbl>, COUNTYENDN <lgl>, END_RANGE <dbl>,
## # END_AZI <lgl>, END_LOCATI <lgl>, LENGTH <dbl>, WIDTH <dbl>, F <dbl>,
## # MAG <dbl>, FATALITIES <dbl>, INJURIES <dbl>, PROPDMG <dbl>,
## # PROPDMGEXP <chr>, CROPDMG <dbl>, CROPDMGEXP <chr>, WFO <lgl>,
## # STATEOFFIC <lgl>, ZONENAMES <lgl>, LATITUDE <dbl>, LONGITUDE <dbl>,
## # LATITUDE_E <dbl>, LONGITUDE_ <dbl>, REMARKS <lgl>, REFNUM <dbl>
raw$PROPDMGEXP <- as.factor(raw$PROPDMGEXP)
levels(raw$PROPDMGEXP)
## [1] "-" "?" "+" "0" "1" "2" "3" "4" "5" "6" "7" "8" "B" "h" "H" "K" "m" "M"
raw$CROPDMGEXP <- as.factor(raw$CROPDMGEXP)
levels(raw$CROPDMGEXP)
## [1] "?" "0" "2" "B" "k" "K" "m" "M"
raw <- raw %>% mutate(PROPDMG = ifelse(PROPDMGEXP == "K", PROPDMG*1000, PROPDMG),
PROPDMG = ifelse(PROPDMGEXP == "M", PROPDMG*1000000, PROPDMG),
PROPDMG = ifelse(PROPDMGEXP == "B", PROPDMG*1000000000, PROPDMG),
PROPDMG = ifelse(PROPDMGEXP == "H", PROPDMG*100, PROPDMG),
PROPDMG = ifelse(PROPDMGEXP == "h", PROPDMG*100, PROPDMG),
PROPDMG = ifelse(PROPDMGEXP == "m", PROPDMG*1000000, PROPDMG),
PROPDMG = ifelse(PROPDMGEXP == "8", PROPDMG*100000000, PROPDMG),
PROPDMG = ifelse(PROPDMGEXP == "7", PROPDMG*10000000, PROPDMG),
PROPDMG = ifelse(PROPDMGEXP == "6", PROPDMG*1000000, PROPDMG),
PROPDMG = ifelse(PROPDMGEXP == "5", PROPDMG*100000, PROPDMG),
PROPDMG = ifelse(PROPDMGEXP == "4", PROPDMG*10000, PROPDMG),
PROPDMG = ifelse(PROPDMGEXP == "3", PROPDMG*1000, PROPDMG),
PROPDMG = ifelse(PROPDMGEXP == "2", PROPDMG*100, PROPDMG),
PROPDMG = ifelse(PROPDMGEXP == "1", PROPDMG*10, PROPDMG),
PROPDMG = ifelse(PROPDMGEXP == "0", PROPDMG*1, PROPDMG)
) %>%
mutate(CROPDMG = ifelse(CROPDMGEXP == "B", CROPDMG*1000000000, CROPDMG),
CROPDMG = ifelse(CROPDMGEXP == "M", CROPDMG*1000000, CROPDMG),
CROPDMG = ifelse(CROPDMGEXP == "m", CROPDMG*1000000, CROPDMG),
CROPDMG = ifelse(CROPDMGEXP == "K", CROPDMG*1000, CROPDMG),
CROPDMG = ifelse(CROPDMGEXP == "k", CROPDMG*1000, CROPDMG),
CROPDMG = ifelse(CROPDMGEXP == "2", CROPDMG*100, CROPDMG),
CROPDMG = ifelse(CROPDMGEXP == "0", CROPDMG*1, CROPDMG),
CROPDMG = ifelse(CROPDMGEXP == "?", CROPDMG*0, CROPDMG)
)
Next, a complete listing of the storm event types is needed. The Storm Data Documentation contains the complete list on page 6 of the documentation. The event types will be read in from this file.
First, load the required libraries. The package pdftools will read in the .pdf file, the tm package will convert the text in the file into a vector source and then a corpus, which can then be mined with regular expressions using the stringr package.
events <- pdf_text("https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf")
events <- events[6]
events <- stripWhitespace(events)
events <- str_extract(events, pattern = "Astronomical.*Winter Weather")
events <- str_split(events, pattern = "\\s[A-Z]\\s")
events <- unlist(events)
events <- tolower(events)
The resulting vector, events is a list, with each page from the original pdf being an element in the list. Therefore the list has 97 elements.
The ‘VectorSource’ and ‘Corpus’ functions are used to convert the text into a format that can be easily mined by regular expressions. Page 6 will be extracted from the list and the list will be flattened.
Now, using regular expressions, the complete listing of event types will be created for use later. The final list is set to lower case letters to allow for easier matching with the storm data.
data <- raw %>%
mutate(EVTYPE = str_to_lower(EVTYPE),
EVTYPE = str_trim(EVTYPE, side = "left"),
EVTYPE = str_replace_all(EVTYPE, pattern = "tstm", replacement = "thunderstorm"),
EVTYPE = str_replace_all(EVTYPE, pattern = "\\s\\(g\\d*\\)", replacement = ""),
EVTYPE = str_replace_all(EVTYPE, pattern = "^heat", replacement = "excessive heat"),
EVTYPE = str_replace_all(EVTYPE, pattern = "wild/forest fire", replacement = "wildfire"),
EVTYPE = str_replace_all(EVTYPE, pattern = "strong wind", replacement = "high wind"),
EVTYPE = str_replace_all(EVTYPE, pattern = "winter weather", replacement = "winter storm"),
EVTYPE = str_replace_all(EVTYPE, pattern = "^hurricane$", replacement = "hurricane (typhoon)"))
The “EVTYPE” variable contains numerous different spellings and classification of the event types. Since there are only 48 official event types, the data in this variable must be cleaned to allow for proper classification. Several string replacements are made to frequently occurring entries that have a significant impact on the final analysis. Spaces are removed and set to lower case as in the events list.
data <- data %>%
mutate(Eventcode = amatch(EVTYPE, events, maxDist = 5)) %>%
mutate(Event = ifelse(is.na(Eventcode), EVTYPE, events[Eventcode]))
The “EVTYPE” is matched to the events list using the ‘amatch’ function and placed in a new variable is created in the data set. Finally, if the new variable is NA, then use the observation in the ‘EVTYPE’ variable, otherwise, use the matching entry in the events list.
The data set is first tidied by combining the ‘FATALITIES’ and ‘INJURIES’ variables into a single column and their values into another variable called ‘Number’. Any observation in the ‘Number’ variable equal to zero will be removed, since there were no casualties. The data will be grouped by ‘Event’ and ‘Number’ will be summarized. The top 10 will be selected, first letters will be capitalized and only the ‘Event’ column will be output.
Then will be analysed again to provide output for charting. The data will be tidied again as above and the ‘Event’ column changed to first letter to uppercase. Then the data set will be reduced to only those events appearing in the top 10 list created above and then the ‘Event’ column is changed to a factor variable, with the levels set to the order of the top ten list. This will put the bars in order in the chart.
top10_casualty <- data %>%
gather(FATALITIES, INJURIES, key = Casualty, value = Number) %>%
filter(Number > 0) %>%
group_by(Event) %>%
summarize(Total = sum(Number)) %>%
arrange(Total) %>%
top_n(10, Total) %>%
mutate(Event = str_to_title(Event)) %>%
select(1)
Casualty <- data %>%
gather(FATALITIES, INJURIES, key = Casualty, value = Number) %>%
mutate(Event = str_to_title(Event)) %>%
filter(Number > 0, Event %in% top10_casualty$Event) %>%
mutate(Event = factor(Event, top10_casualty$Event))
The steps for determining the financial impact of weather events are similiar to those above. The top 10 most costly events will be identified and ordered. The data will then be adjusted for inflation and output for charting.
top10_econ <- data %>%
gather(CROPDMG, PROPDMG, key = Damage, value = Number) %>%
filter(Number > 0) %>%
group_by(Event) %>%
summarize(Total = sum(Number)) %>%
arrange(Total) %>%
top_n(10, Total) %>%
mutate(Event = str_to_title(Event)) %>%
select(1)
Damage <- data %>%
gather(CROPDMG, PROPDMG, key = Damage, value = Number) %>%
mutate(Event = str_to_title(Event)) %>%
filter(Number > 0, Event %in% top10_econ$Event) %>%
mutate(Event = factor(Event, top10_econ$Event))
ggplot(Damage) +
aes(x = Event, fill = Damage, weight = Number) +
geom_bar() +
scale_fill_hue() +
labs(x = "Events ", y = "USD of damages ", title = "Top 10 Events harming US Economy ") +
coord_flip() +
theme_minimal()
ggplot(Casualty) +
aes(x = Event, fill = Casualty, weight = Number) +
geom_bar() +
scale_fill_hue() +
labs(x = "Events", y = "Numbers of casualties", title = "Top 10 Atmospherical Events causing Casualties") +
coord_flip() +
theme_minimal()
As we can clealrly see from the graphs Flood, Hurricanes and Tornados are the 3 most harmfull events for the american economy. For what it concerns Public Health and Casualties, Tornados, excessive hea and Thunderstorms are the most dangerous for human life.