Synopsis

The goal of the analysis was to aswer these two questions:

The analysis is based on data gathered in the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database.

As the indicator of the first question total damage in property and crops in millions of USD was chosen. As the indicator of the latter, total number of fatalities and injuries was chosen.

Data Processing

Preparation

First, I set global options and load necessary libraries:

library(knitr)
library(dplyr)
library(ggplot2)
library(RColorBrewer)
library(maps)

opts_chunk$set(echo = TRUE, message=FALSE, warning = FALSE, cache=TRUE)

After that I check if the necessary raw data are in the working directory. If not, I download them.

## check if the file already exists in the working directory

if(!file.exists("StormData.csv.bz2")){
    
    url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"

    zipped <- download.file(url,"StormData.csv.bz2")
}

In the next step I load raw data from the csv file. The read.csv() function takes care of unpacking the data.

stormData <- read.csv("StormData.csv.bz2", header = TRUE)

Data cleaning and processing

The raw data set is not a tidy data set. Some variables are not necessary to answer the research questions, so I select only a few of them:

reduce data set

selectedData <- select(stormData, one_of(c("EVTYPE", "FATALITIES", "INJURIES","PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP", "STATE")))

Then I perform a set of data transformations.

First, I correct this mistake, where milions were coded as bilions:

### correcting the 115 B

wrongNumber <- selectedData[605953,"PROPDMGEXP"]
print(wrongNumber)
## [1] B
## Levels:  - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
selectedData[605953,"PROPDMGEXP"] <- "M"

As an indicator of economic impact I choose total amount of property damage in millions of US dollars. The respective variable is called totaldmgM.

To calculate the variable I first recode factor variables PROPDMGEXP and CROPDMGEXP to numeric types. Some of the levels are neglible and are recoded to 1:

# transform levels

table(selectedData$PROPDMGEXP)
## 
##             -      ?      +      0      1      2      3      4      5 
## 465934      1      8      5    216     25     13      4      4     28 
##      6      7      8      B      h      H      K      m      M 
##      4      5      1     39      1      6 424665      7  11331
table(selectedData$CROPDMGEXP)
## 
##             ?      0      2      B      k      K      m      M 
## 618413      7     19      1      9     21 281832      1   1994
# levels other than "no value", B, K, M are neglible as their refer only
# to a very small subset of data. Therefore these levels are recoded to 1.
# "No value" will be recoded to 1 as well; K = 1000, M = 1000000, B = 1000000000


transformedData <- mutate(selectedData, id = as.numeric(row.names(selectedData)))

levels(transformedData$PROPDMGEXP) <- c("1", "1","1","1","1","1","1","1","1","1",
                                        "1","1","1","1000000000", "1", "1","1000",
                                        "1","1000000")

transformedData$PROPDMGEXP <- as.numeric(levels(transformedData$PROPDMGEXP))[transformedData$PROPDMGEXP]

levels(transformedData$CROPDMGEXP) <- c("1", "1", "1", "1", "1000000000", "1",
                                        "1", "1000", "1", "1000000")
transformedData$CROPDMGEXP <- as.numeric(levels(transformedData$CROPDMGEXP))[transformedData$CROPDMGEXP]

Then the variable totaldmgM is calculated as a sum of property and crops damage in millions of USD.

transformedData <- transformedData %>%
    mutate(propdmgM = (PROPDMG * PROPDMGEXP / 1000000),
           cropdmgM = (CROPDMG * CROPDMGEXP) / 1000000) %>%
    mutate(totaldmgM = propdmgM + cropdmgM)

The data stored in EVTYPE variable need a lot of cleaning. For instance, some values were entered in lower-case, some in upper-case etc. This resulted in creating two levels where it was actually one level. In order to get rid of such errors, I set the levels in EVTYPE to case unsensitive

levels(transformedData$EVTYPE) <- tolower(levels(transformedData$EVTYPE))

Then I decided to convert EVTYPE variable values to more general categories, as some of factor levels refer to the same or very similar kind of event.

categorize <- function(x){
    
    categories <- c("flood", "tide", "tornado", "fire", "lightning", "storm", 
                    "hurricane", "wind", "rain", "snow", "heat", "cold", "hail")
    
    x <- gsub(".*flood.*", "flood", x)
    x <- gsub(".*tide.*", "tide", x)
    x <- gsub(".*surge.*", "tide", x)
    x <- gsub(".*tornado.*", "tornado", x)
    x <- gsub(".*fire.*", "fire", x)
    x <- gsub(".*lightning.*", "lightning", x)
    x <- gsub(".*storm.*", "storm", x)
    x <- gsub(".*hurricane.*", "hurricane", x)
    x <- gsub(".*typhoon.*", "hurricane", x)
    x <- gsub(".*wind.*", "wind", x)
    x <- gsub(".*rain.*", "rain", x)
    x <- gsub(".*snow.*", "snow", x)
    x <- gsub(".*blizzard.*", "snow", x)
    x <- gsub(".*avalanc.*", "snow", x)
    x <- gsub("(.*heat.*) | (.*drought.*)", "heat", x)
    x <- gsub(".*heat.*", "heat", x)
    x <- gsub(".*drought.*", "heat", x)
    x <- gsub(".*cold.*", "cold", x)
    x <- gsub(".*cool.*", "cold", x)
    x <- gsub(".*freeze.*", "cold", x)
    x <- gsub(".*hail.*", "hail", x)
    

}

eventKind <- as.data.frame(categorize(transformedData$EVTYPE))

names(eventKind) <- c("eventKind")

transformedData <- bind_cols(transformedData, eventKind)

As the indicator of impact of weather events on population’s health I introduce “healthDamage” variable, which is a sum of fatalities and injuries:

transformedData <- transformedData %>%
    mutate(healthDamage = FATALITIES + INJURIES)

Calculation of results

When the data are more or less tidy, I calculate the results:

Calculation of total property damage per event type:

options(scipen=999)

eventDamage <- group_by(transformedData, eventKind) %>%
    summarise(totalDamage = sum(totaldmgM, na.rm = TRUE)) %>%
    arrange(desc(totalDamage))

Then I select top 10 most harmful event types:

topDamage <- top_n(eventDamage, 10)

topDamageLabels <- as.character(topDamage[["eventKind"]])

topDamage$eventKind <- factor(topDamage$eventKind, levels = topDamageLabels)

mostHarmfulEconomic <- topDamageLabels[1]

Calculation of the most harmful events by state:

stateDamage <- transformedData %>%
    group_by(STATE, eventKind, add = TRUE) %>%
    summarise(stateDamage = sum(totaldmgM, na.rm = TRUE)) %>%
    arrange(STATE, desc(stateDamage)) %>%
    top_n(1, stateDamage) %>%
    filter(stateDamage > 0)

In order to be able to draw the map later, we need to get names for states abbreviations and add them to data frame:

url2 <- "http://www.whypad.com/wp-content/uploads/us_states.zip"

statesAbr <- download.file(url2, "abrr.zip")

unzipped <- unzip("abrr.zip")

statesAbbreviations <- read.csv(unzipped,
                                header = FALSE,
                                stringsAsFactors = FALSE,
                                sep = ",")
names(statesAbbreviations) <- c("statename", "region", "STATE")

statesAbbreviations$region <- tolower(statesAbbreviations$region)

stateDamage <- left_join(stateDamage, statesAbbreviations, by="STATE")

stateDamage <- stateDamage %>%
    filter(!is.na(region))

Calculation population health impact per event type and selecting top 10 most harmful event types:

healthDamage <- group_by(transformedData, eventKind) %>%
    summarise(totalHealthDamage = sum(healthDamage, na.rm = TRUE)) %>%
    arrange(desc(totalHealthDamage))

topHealthDamage <- top_n(healthDamage, 10)

topHealthLabels <- as.character(topHealthDamage[["eventKind"]])

topHealthDamage$eventKind <- factor(topHealthDamage$eventKind, 
                                    levels = topHealthLabels)

mostHarmfulHealth <- topHealthLabels[1]

Results

Across the United States, the following types of events have the greatest economic consequences:

damagePlot <- ggplot(topDamage, aes(eventKind, totalDamage)) +
    geom_bar(stat="identity", aes(fill=eventKind)) +
    scale_fill_brewer(palette = "Spectral", name = "Event type", labels = topDamageLabels) +
    labs(title="Total property damage by weather event kind",
         x = "Weather event kind",
         y = "Property damage in milions of USD") +
    theme_classic()

print(damagePlot)

The event that has the most harmful economic consequences is hurricane.

However, the most harmful event type vary by state. The following maps presents which event type caused the highest damage by state:

map <- map_data("state")

stateDamageMap <- merge(stateDamage, map, by="region")

stateDamagePlot <- ggplot(stateDamageMap, 
                          aes(fill = eventKind)) +
    geom_map(aes(map_id=region), 
             map = map,
             colour="grey",
             linetype="blank") +
    expand_limits(x=map$long, y=map$lat) +
    scale_fill_brewer(palette = "Spectral", name = "Event type") +
    theme_minimal() +
    labs(title="Most harmful event type by state", x = "Longitude",
         y = "Lattitude")
    
print(stateDamagePlot)

Across the United States, which types of events are most harmful with respect to population health:

healthPlot <- ggplot(topHealthDamage, aes(eventKind, totalHealthDamage)) +
    geom_bar(stat="identity", aes(fill=eventKind)) +
    scale_fill_brewer(palette = "Spectral", name = "Event type", labels = topHealthLabels) +
    labs(title="Total health damage by weather event kind",
         x = "Weather event kind",
         y = "Number of fatalities and injuries") +
    theme_classic()

print(healthPlot)

The event that has the most harmful impact on population health is tornado