An Analysis of Weather Events
By: Mohammed Teslim
15/09/22
The U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database was created and saddled with function of recording varying weather events in the United States as well as time, location, degree of impact and other characteristics of each weather event. An analysis of this data is hereby carried out starting from the year 1950 through to 2011. The purpose is to find out the impact of these weather events on the economics and the population health of the United States as well as to ascertain the specific weather events that are of the highest impact.
It is however found that across the country and the time range selected for this analysis, floods account for the event type associated with the highest economic implication, and, tornadoes are most associated with detriments to the health of the population. Although, most of the event types are similar in their economic and health impacts, floods and tornadoes stand out as the most impactful in terms of economic and health consequences respectively. The remainder of this writeup shows all of the steps as well as codes used in the processing of the data and arriving at the results.
The first in the processing of the data is to create a directory where the files of the analysis will be stored and retrieved.
if(!dir.exists("/Rreproducible-Data-Course-Project-2")){
dir.create("/Rreproducible-Data-Course-Project-2")
}
We have to ensure that the working director is the same as this
directly that we just created, if not we set it with the
getwd() function.
if(!str_ends(getwd(), "/Rreproducible-Data-Course-Project-2")){
setwd("/Rreproducible-Data-Course-Project-2")
}
We get the file from the internet via the link, download it into the working directory and read into an R object. Note that the file is a bzfile so we read with the appropriate code.
Url<-"https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(Url, "stormdata.bz2")
storm_data <- read.csv(bzfile("stormdata.bz2"))
Then we look at our data and start with the preprocessing. It is noticed that a lot of the values are missing, and these missing values are not represented with NA, rather they are just totally absent i.e represented with white spaces. So, we turn all of these white spaces to NA to give a dataframe we can better work it
storm_data[storm_data ==""] <- NA
In the EVTYPE column which is a character column showing
the Event Type variable, it is found out that a lot of the rows have
misspelt events e.g flood spelt as flooods (notice the extra “o”) and
some include abbreviations e.g thunderstorm written as tstm, and a lot
more. Asides these, it was also noted that a lot of strings include
words that point to the same event, for example; hot, heat, warm and
warmth are all implying higher temperature. Another example is sea,
ocean, coastal, marine, dam are all implying water bodies and they are
therefore grouped accordingly.
Therefore, using the stringr package present in the tidyverse, we create a function that can modify all of the strings containing these words and turn them to a single word or phrase that represents the event being described.
change_character <- function(x){
x %>%
str_replace("^(.*)(thun|tstm)(.*)$", "THUNDERSTORM") %>%
str_replace("^(.*)(flood|floood|urban|stream|rising\\swater|
floyd|high\\swater)(.*)$", "FLOODING") %>%
str_replace("^(.*)(microburst)(.*)$", "MICROBURST") %>%
str_replace("^(.*)(tornado|land\\s?spout)(.*)$", "TORNADO") %>%
str_replace("^(.*)(light|lig)(.*)$", "LIGHTNING") %>%
str_replace("^(.*)(fire)(.*)$", "FIRE") %>%
str_replace("^(.*)(wi?nd|downburst)(.*)$", "WIND") %>%
str_replace("^(.*)(snow|ice|sleet|icy|hail|
precip|frost|wintry)(.*)$", "PRECIPITATION") %>%
str_replace("^(.*)(cold|cool|freez|winter|low\\s*temp
|hypo|record\\slow)(.*)$", "LOW TEMPERATURES") %>%
str_replace("^(.*)(hot|heat|warm|warmth|high\\s*tem|
record\\s*tem|hyper|record\\shigh)(.*)$", "HIGH TEMPERATURES") %>%
str_replace("^(.*)(hurricane)(.*)$", "HURRICANE") %>%
str_replace("^(.*)(blizzard)(.*)$", "BLIZZARD") %>%
str_replace("^(.*)(rain|shower)(.*)$", "RAIN") %>%
str_replace("^(.*)(tsunami)(.*)$", "TSUNAMI") %>%
str_replace("^(.*)(dust)(.*)$", "DUST") %>%
str_replace("^(.*)(volca)(.*)$", "VOLCANO") %>%
str_replace("^(.*)(dry|dri|drought)(.*)$", "DRYNESS") %>%
str_replace("^(.*)(wet)(.*)$", "WETNESS") %>%
str_replace("^(.*)(tropic|typhoon)(.*)$", "TROPICAL STORM") %>%
str_replace("^(.*)(way?ter\\s?spr?out)(.*)$", "WATERSPOUT") %>%
str_replace("^(.*)(tide|rip)(.*)$", "CURRENT/TIDE") %>%
str_replace("^(.*)(gust)(.*)$", "GUSTNADO") %>%
str_replace("^(.*)(storm)(.*)$", "OTHER STORMS") %>%
str_replace("^(.*)(surf|sea|ocean|coastal|
swell|erosi|marine|dam|drown)(.*)$", "WATER BODIES") %>%
str_replace("^(.*)(funnel|cloud|fog|vog)(.*)$", "CLOUDS") %>%
str_replace("^(.*)(slide|slump|avalanch?e)(.*)$", "SLOPE EVENTS") %>%
str_replace("^(.*)(smoke)(.*)$", "SMOKE") %>%
str_replace("^[Wa-z].*[a-z]$|^\\W$", "OTHERS") %>%
str_trim()
}
Other columns we shall be needing are the PROPDMGEXP and
CROPDMGEXP which is a character column that shows the unit
by which the preceeding columns i.e the PROPDMG and
CROPDMG are raised. Either billions (B), thousands (K), or
hundreds (H). We therefore create a function to replace these letters
with their equivalents in digits.
change_exp <- function(x){
x %>%
str_replace("B|b", "1000000000") %>%
str_replace("M|m", "1000000") %>%
str_replace("K|k", "1000") %>%
str_replace("H|h", "100") %>%
str_trim()
}
This analysis begs to answer two questions. Which types of events are most harmful with respect to population health?, and, which types of events have the greatest economic consequences?. For this we could be needing two dataframes; one that that selects the variables needed for each question. We then apply the functions created.
In the code below, the needed variables are selected, the strings in the event types are converted to lowercase for uniformity, the functions are applied and we filter out the rows that do not correspond to any events (in this case they start with “sum” and “month” and they are basically summary rows.).
Next two tables were created. Health_Table is formed by
grouping according to EVENTS, summarizing by the sum of
each of the FATALITIES and INJURIES for each
event, tidying this two values into another variable called
HEALTH_DAMAGE then arranging in descending acoording to the
total FATALITIES + INJURIES.
For the Econ_Table, it is noticed that some values in
the PROPDMGEXP and CROPDMGEXP contains non-alphabetic characters, we
don’t need these so we filter them out. Then the PROPDMG and CROPDMG
variables are converted to their true numeric form by multiplying them
with the converted exponentials, the data is then grouped and summarized
accordingly as Health_Table above and then arranged in
descending total dollar implication.
HEALTH <- storm_data %>%
select(EVTYPE, FATALITIES, INJURIES) %>%
mutate(EVTYPE = str_to_lower(EVTYPE)) %>%
mutate(EVENTS = change_character(EVTYPE)) %>%
filter(!str_detect(EVENTS, "sum") & !str_detect(EVENTS, "month"))
Health_Table <- HEALTH %>%
group_by(EVENTS) %>%
summarize(FATALITIES = sum(FATALITIES),
INJURIES = sum(INJURIES),
TOTAL = FATALITIES + INJURIES) %>%
gather(`FATALITIES`, `INJURIES`,
key = "HEALTH_DAMAGE", value = "COUNT")
ECONS <- storm_data %>%
select(EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP) %>%
mutate(EVTYPE = str_to_lower(EVTYPE)) %>%
mutate(EVENTS = change_character(EVTYPE)) %>%
filter(!str_detect(EVENTS, "sum") & !str_detect(EVENTS, "month"))
Econ_Table <- ECONS %>%
filter(str_detect(PROPDMGEXP,"[:alpha:]") &
str_detect(CROPDMGEXP,"[:alpha:]")) %>%
mutate(PROPDMG_COST = PROPDMG * as.numeric(change_exp(PROPDMGEXP)),
CROPDMG_COST = CROPDMG * as.numeric(change_exp(CROPDMGEXP))) %>%
group_by(EVENTS) %>%
summarize(PROPDMG = sum(PROPDMG_COST),
CROPDMG = sum(CROPDMG_COST),
TOTAL = PROPDMG + CROPDMG) %>%
gather(`PROPDMG`, `CROPDMG`, key = "DAMAGE", value = "VALUE") %>%
arrange(desc(TOTAL))
The data needed to answer the question about health implications is
stored in the HEALTH object while the that of the economic
implication is stored in the ECONS object.
We shall start by attempting to answer the question regarding the
population health. So, we create a grouped bar chart that shows both the
property and crop damage for each event type. Notice that the x axis has
been reordered according to the increasing TOTAL value. The
logarithm of the y axis and suitable breaks have been selected because
of the distribution of the values.
Health_Plot <- Health_Table %>%
ggplot(aes(x = reorder(EVENTS, TOTAL), y = COUNT, fill = HEALTH_DAMAGE)) +
geom_col(colour = "black", position = position_dodge(0.7), width = 0.5) +
scale_y_log10(breaks = c(10,100,1000,10000,75000)) +
scale_fill_brewer(labels = c("FATALITIES", "INJURIES"),
palette = "Set1") +
theme(axis.text.x = element_text(angle = 30, hjust = 1, vjust = 1,
size = rel(0.5))) +
theme(legend.background = element_rect(fill = "white", colour = "black")) +
xlab("Event Type") + ylab("Count") +
labs(fill = "Type of Health Damage")
Then the question regarding economic implications. Similar grouped bar chart has also been created for this purpose. Logarithm of the y axis with appropriate breaks have been plotted for proper fitting on plot area.
Econ_Plot <- Econ_Table %>%
ggplot(aes(x = reorder(EVENTS, TOTAL), y = VALUE, fill = DAMAGE)) +
geom_col(colour = "black", position = position_dodge(0.7), width = 0.5) +
scale_fill_brewer(labels = c("Crop Damage", "Property Damage"),
palette = "Accent") +
scale_y_log10(breaks = c(10^3,10^4,10^5,10^8,10^11),
labels = trans_format("log10", math_format(10^.x))) +
theme(axis.text.x = element_text(angle = 30, hjust = 1, vjust = 1,
size = rel(0.5))) +
theme(legend.background = element_rect(fill = "white", colour = "black")) +
xlab("Event Type") + ylab("Value (Log of Dollar Amount)") +
labs(fill = "Type of Economic Damage")
The plot below is a graphical representation of the event types and the health impact on the population. It can be seen that Tornadoes are the event type with the highest fatalities and the highest injuries and therefore are the most harmful to the population health of the people of the United States. Followed closely are events that results into High Temperatures and then Thunderstorms. It can be seen that Gustnado, Smoke, Volcanoes have the smallest impact on the population health of the United States.
print(Health_Plot)
Plot showing event type with associated fatalities and injuries
The plot below shows the association of event types with the damages to crops and properties as measured in dollars. It can be seen that Floods have the highest impact to damages of both crops and properties, followed closely is Hurricane and then Precipitation. Volcanoes contribute least to the damage of crops and properties.
print(Econ_Plot)
Plot showing event type with associated crop and property damage
This table shows the ranking of the events in descending magnitude of health impact in terms of Fatalities and Injuries, it can be seen again, that Tornado that takes the lead.
knitr::kable(Health_Table %>% group_by(EVENTS) %>% summarize(TOTAL = unique(TOTAL)) %>% arrange(desc(TOTAL)))
| EVENTS | TOTAL |
|---|---|
| TORNADO | 97043 |
| HIGH TEMPERATURES | 12422 |
| THUNDERSTORM | 10301 |
| FLOODING | 10240 |
| LIGHTNING | 6051 |
| PRECIPITATION | 5037 |
| LOW TEMPERATURES | 2705 |
| WIND | 2691 |
| FIRE | 1698 |
| HURRICANE | 1461 |
| CLOUDS | 1159 |
| CURRENT/TIDE | 1122 |
| BLIZZARD | 906 |
| DUST | 507 |
| SLOPE EVENTS | 494 |
| TROPICAL STORM | 454 |
| WATER BODIES | 449 |
| RAIN | 380 |
| OTHERS | 267 |
| TSUNAMI | 162 |
| OTHER STORMS | 57 |
| WATERSPOUT | 32 |
| MICROBURST | 31 |
| DRYNESS | 4 |
| GUSTNADO | 0 |
| SMOKE | 0 |
| VOLCANO | 0 |
| WETNESS | 0 |