Reproducible-Course-Proj.knit

Description

The U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database was created and saddled with function of recording varying weather events in the United States as well as time, location, degree of impact and other characteristics of each weather event. An analysis of this data is hereby carried out starting from the year 1950 through to 2011. The purpose is to find out the impact of these weather events on the economics and the population health of the United States as well as to ascertain the specific weather events that are of the highest impact.

It is however found that across the country and the time range selected for this analysis, floods account for the event type associated with the highest economic implication, and, tornadoes are most associated with detriments to the health of the population. Although, most of the event types are similar in their economic and health impacts, floods and tornadoes stand out as the most impactful in terms of economic and health consequences respectively. The remainder of this writeup shows all of the steps as well as codes used in the processing of the data and arriving at the results.

Data Processing

The first in the processing of the data is to create a directory where the files of the analysis will be stored and retrieved.

if(!dir.exists("/Rreproducible-Data-Course-Project-2")){
    dir.create("/Rreproducible-Data-Course-Project-2")
}

We have to ensure that the working director is the same as this directly that we just created, if not we set it with the getwd() function.

if(!str_ends(getwd(), "/Rreproducible-Data-Course-Project-2")){
    setwd("/Rreproducible-Data-Course-Project-2")
}

We get the file from the internet via the link, download it into the working directory and read into an R object. Note that the file is a bzfile so we read with the appropriate code.

Url<-"https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(Url, "stormdata.bz2")
storm_data <- read.csv(bzfile("stormdata.bz2"))

Preprocessing

Then we look at our data and start with the preprocessing. It is noticed that a lot of the values are missing, and these missing values are not represented with NA, rather they are just totally absent i.e represented with white spaces. So, we turn all of these white spaces to NA to give a dataframe we can better work it

storm_data[storm_data ==""] <- NA

In the EVTYPE column which is a character column showing the Event Type variable, it is found out that a lot of the rows have misspelt events e.g flood spelt as flooods (notice the extra “o”) and some include abbreviations e.g thunderstorm written as tstm, and a lot more. Asides these, it was also noted that a lot of strings include words that point to the same event, for example; hot, heat, warm and warmth are all implying higher temperature. Another example is sea, ocean, coastal, marine, dam are all implying water bodies and they are therefore grouped accordingly.

Therefore, using the stringr package present in the tidyverse, we create a function that can modify all of the strings containing these words and turn them to a single word or phrase that represents the event being described.

change_character <- function(x){
    x %>% 
        str_replace("^(.*)(thun|tstm)(.*)$", "THUNDERSTORM") %>% 
        str_replace("^(.*)(flood|floood|urban|stream|rising\\swater|
                    floyd|high\\swater)(.*)$", "FLOODING") %>% 
        str_replace("^(.*)(microburst)(.*)$", "MICROBURST") %>% 
        str_replace("^(.*)(tornado|land\\s?spout)(.*)$", "TORNADO") %>% 
        str_replace("^(.*)(light|lig)(.*)$", "LIGHTNING") %>% 
        str_replace("^(.*)(fire)(.*)$", "FIRE") %>% 
        str_replace("^(.*)(wi?nd|downburst)(.*)$", "WIND") %>% 
        str_replace("^(.*)(snow|ice|sleet|icy|hail|
                    precip|frost|wintry)(.*)$", "PRECIPITATION") %>% 
        str_replace("^(.*)(cold|cool|freez|winter|low\\s*temp
                    |hypo|record\\slow)(.*)$", "LOW TEMPERATURES") %>% 
        str_replace("^(.*)(hot|heat|warm|warmth|high\\s*tem|
                    record\\s*tem|hyper|record\\shigh)(.*)$", "HIGH TEMPERATURES") %>% 
        str_replace("^(.*)(hurricane)(.*)$", "HURRICANE") %>% 
        str_replace("^(.*)(blizzard)(.*)$", "BLIZZARD") %>% 
        str_replace("^(.*)(rain|shower)(.*)$", "RAIN") %>%
        str_replace("^(.*)(tsunami)(.*)$", "TSUNAMI") %>%
        str_replace("^(.*)(dust)(.*)$", "DUST") %>%
        str_replace("^(.*)(volca)(.*)$", "VOLCANO") %>%
        str_replace("^(.*)(dry|dri|drought)(.*)$", "DRYNESS") %>%
        str_replace("^(.*)(wet)(.*)$", "WETNESS") %>%
        str_replace("^(.*)(tropic|typhoon)(.*)$", "TROPICAL STORM") %>%
        str_replace("^(.*)(way?ter\\s?spr?out)(.*)$", "WATERSPOUT") %>%
        str_replace("^(.*)(tide|rip)(.*)$", "CURRENT/TIDE") %>%
        str_replace("^(.*)(gust)(.*)$", "GUSTNADO") %>%
        str_replace("^(.*)(storm)(.*)$", "OTHER STORMS") %>%
        str_replace("^(.*)(surf|sea|ocean|coastal|
                    swell|erosi|marine|dam|drown)(.*)$", "WATER BODIES") %>%
        str_replace("^(.*)(funnel|cloud|fog|vog)(.*)$", "CLOUDS") %>%
        str_replace("^(.*)(slide|slump|avalanch?e)(.*)$", "SLOPE EVENTS") %>%
        str_replace("^(.*)(smoke)(.*)$", "SMOKE") %>%
        str_replace("^[Wa-z].*[a-z]$|^\\W$", "OTHERS") %>%
        str_trim()
}

Other columns we shall be needing are the PROPDMGEXP and CROPDMGEXP which is a character column that shows the unit by which the preceeding columns i.e the PROPDMG and CROPDMG are raised. Either billions (B), thousands (K), or hundreds (H). We therefore create a function to replace these letters with their equivalents in digits.

change_exp <- function(x){
    x %>% 
        str_replace("B|b", "1000000000") %>% 
        str_replace("M|m", "1000000") %>%
        str_replace("K|k", "1000") %>% 
        str_replace("H|h", "100") %>% 
        str_trim()
}

Transformation

This analysis begs to answer two questions. Which types of events are most harmful with respect to population health?, and, which types of events have the greatest economic consequences?. For this we could be needing two dataframes; one that that selects the variables needed for each question. We then apply the functions created.

In the code below, the needed variables are selected, the strings in the event types are converted to lowercase for uniformity, the functions are applied and we filter out the rows that do not correspond to any events (in this case they start with “sum” and “month” and they are basically summary rows.).

Next two tables were created. Health_Table is formed by grouping according to EVENTS, summarizing by the sum of each of the FATALITIES and INJURIES for each event, tidying this two values into another variable called HEALTH_DAMAGE then arranging in descending acoording to the total FATALITIES + INJURIES.

For the Econ_Table, it is noticed that some values in the PROPDMGEXP and CROPDMGEXP contains non-alphabetic characters, we don’t need these so we filter them out. Then the PROPDMG and CROPDMG variables are converted to their true numeric form by multiplying them with the converted exponentials, the data is then grouped and summarized accordingly as Health_Table above and then arranged in descending total dollar implication.

HEALTH <- storm_data %>% 
            select(EVTYPE, FATALITIES, INJURIES) %>% 
            mutate(EVTYPE = str_to_lower(EVTYPE)) %>% 
            mutate(EVENTS = change_character(EVTYPE)) %>% 
            filter(!str_detect(EVENTS, "sum") & !str_detect(EVENTS, "month"))

Health_Table <- HEALTH %>% 
                    group_by(EVENTS) %>% 
                    summarize(FATALITIES = sum(FATALITIES),
                              INJURIES = sum(INJURIES),
                              TOTAL = FATALITIES + INJURIES) %>%
                    gather(`FATALITIES`, `INJURIES`, 
                           key = "HEALTH_DAMAGE", value = "COUNT")


ECONS <- storm_data %>% 
         select(EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP) %>% 
         mutate(EVTYPE = str_to_lower(EVTYPE)) %>% 
        mutate(EVENTS = change_character(EVTYPE)) %>% 
        filter(!str_detect(EVENTS, "sum") & !str_detect(EVENTS, "month"))

Econ_Table <- ECONS %>% 
    filter(str_detect(PROPDMGEXP,"[:alpha:]") &
               str_detect(CROPDMGEXP,"[:alpha:]")) %>%
    mutate(PROPDMG_COST = PROPDMG * as.numeric(change_exp(PROPDMGEXP)), 
           CROPDMG_COST = CROPDMG * as.numeric(change_exp(CROPDMGEXP))) %>%
    group_by(EVENTS) %>% 
    summarize(PROPDMG = sum(PROPDMG_COST),
              CROPDMG = sum(CROPDMG_COST),
              TOTAL = PROPDMG + CROPDMG) %>% 
    gather(`PROPDMG`, `CROPDMG`, key = "DAMAGE", value = "VALUE") %>% 
    arrange(desc(TOTAL))

The data needed to answer the question about health implications is stored in the HEALTH object while the that of the economic implication is stored in the ECONS object.

Exploratory Analysis

We shall start by attempting to answer the question regarding the population health. So, we create a grouped bar chart that shows both the property and crop damage for each event type. Notice that the x axis has been reordered according to the increasing TOTAL value. The logarithm of the y axis and suitable breaks have been selected because of the distribution of the values.

 Health_Plot <- Health_Table %>%
    ggplot(aes(x = reorder(EVENTS, TOTAL), y = COUNT, fill = HEALTH_DAMAGE)) + 
    geom_col(colour = "black", position = position_dodge(0.7), width = 0.5) +
    scale_y_log10(breaks = c(10,100,1000,10000,75000)) +
    scale_fill_brewer(labels = c("FATALITIES", "INJURIES"),
                      palette = "Set1") +
    theme(axis.text.x = element_text(angle = 30, hjust = 1, vjust = 1,
                                     size = rel(0.5))) +
    theme(legend.background = element_rect(fill = "white", colour = "black")) +
    xlab("Event Type") + ylab("Count") + 
    labs(fill = "Type of Health Damage")

Then the question regarding economic implications. Similar grouped bar chart has also been created for this purpose. Logarithm of the y axis with appropriate breaks have been plotted for proper fitting on plot area.

Econ_Plot <- Econ_Table %>%
            ggplot(aes(x = reorder(EVENTS, TOTAL), y = VALUE, fill = DAMAGE)) + 
            geom_col(colour = "black", position = position_dodge(0.7), width = 0.5) +
            scale_fill_brewer(labels = c("Crop Damage", "Property Damage"),
                      palette = "Accent") +
            scale_y_log10(breaks = c(10^3,10^4,10^5,10^8,10^11),
                  labels = trans_format("log10", math_format(10^.x))) +
            theme(axis.text.x = element_text(angle = 30, hjust = 1, vjust = 1,
                                     size = rel(0.5))) +
            theme(legend.background = element_rect(fill = "white", colour = "black")) +
            xlab("Event Type") + ylab("Value (Log of Dollar Amount)") + 
            labs(fill = "Type of Economic Damage")

Results

Plots

The plot below is a graphical representation of the event types and the health impact on the population. It can be seen that Tornadoes are the event type with the highest fatalities and the highest injuries and therefore are the most harmful to the population health of the people of the United States. Followed closely are events that results into High Temperatures and then Thunderstorms. It can be seen that Gustnado, Smoke, Volcanoes have the smallest impact on the population health of the United States.

print(Health_Plot)

Plot showing event type with associated fatalities and injuries

The plot below shows the association of event types with the damages to crops and properties as measured in dollars. It can be seen that Floods have the highest impact to damages of both crops and properties, followed closely is Hurricane and then Precipitation. Volcanoes contribute least to the damage of crops and properties.

print(Econ_Plot)

Plot showing event type with associated crop and property damage

Table

This table shows the ranking of the events in descending magnitude of health impact in terms of Fatalities and Injuries, it can be seen again, that Tornado that takes the lead.

knitr::kable(Health_Table %>% group_by(EVENTS) %>% summarize(TOTAL = unique(TOTAL)) %>% arrange(desc(TOTAL)))

EVENTS	TOTAL
TORNADO	97043
HIGH TEMPERATURES	12422
THUNDERSTORM	10301
FLOODING	10240
LIGHTNING	6051
PRECIPITATION	5037
LOW TEMPERATURES	2705
WIND	2691
FIRE	1698
HURRICANE	1461
CLOUDS	1159
CURRENT/TIDE	1122
BLIZZARD	906
DUST	507
SLOPE EVENTS	494
TROPICAL STORM	454
WATER BODIES	449
RAIN	380
OTHERS	267
TSUNAMI	162
OTHER STORMS	57
WATERSPOUT	32
MICROBURST	31
DRYNESS	4
GUSTNADO	0
SMOKE	0
VOLCANO	0
WETNESS	0

An Analysis of Weather Events