Synopsys

This is the assignment for a Coursera course Reproducible Research. A quote from text of the assignment.

“Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage."

Data: Storm Data (47MB)

Data Documentation: National Weather Service Storm Data Documentation National Climatic Data Center Storm Events FAQ

We explore the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database and answer two questions: 1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health? 2. Across the United States, which types of events have the greatest economic consequences?

During the analysis it was found that: 1. Tornado is the most harmful weather event in the U.S with respect to population health. 2. Floods have the greatest economic consequences in the U.S.

Data Processing

Preprocessing

We download the data, if the file doesn’t exist. After we read the data in the variable data.

setwd('.')

if (!file.exists('StormData.csv.bz2')) {
   url <- 'https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2' 
   download.file(url, destfile = 'StormData.csv.bz2', method = 'curl')
}

data <- read.csv('StormData.csv.bz2', stringsAsFactors = FALSE)

We need ggplot2 for graphs, dplyr and tidyr for convenient data manipulation.

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.0.5
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)

Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

I assume that events are causing damage to population health by fatalities (FATALITIES in data set) and injuries (INJURIES in data set).

casualities.by.type <- data %>% 
    # Grouping by event type
    group_by(EVTYPE) %>%                            
    # Calculating summaries for fatalities and injuries
    summarise(FATALITIES = sum(FATALITIES),         
              INJURIES = sum(INJURIES)) %>%
    # Sorting by sum of fatalities and injuries in descending order
    arrange(desc(FATALITIES + INJURIES)) %>%
    # Taking only first 10 records
    slice(1:10) %>%
    # Melting injuries and fatalities 
    gather(CType, Value, c(FATALITIES, INJURIES))
## `summarise()` ungrouping output (override with `.groups` argument)

Now we can build the graph.

ggplot <- ggplot(data = casualities.by.type,
                 aes(x = reorder(EVTYPE, -Value), 
                     y = Value,
                     fill = (CType))) +
    geom_bar(stat = 'identity', col = 'black') +
    labs(title = 'Top 10 Events By Casualties', 
         x = 'Type of event',
         y = 'Counts',
         fill = 'Type') +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))

print(ggplot)

We can see from the graph that tornadoes are most dangerous events for health in U.S.

Across the United States, which types of events have the greatest economic consequences?

In this analysis I assume that economic consequences are property damage (PROPDMG in database) and crops damage (CROPDMG in database).

Here we are getting rid of unnecessary columns.

dt2 <- data[c("EVTYPE", "CROPDMG", "CROPDMGEXP", "PROPDMG", "PROPDMGEXP")]

The database doesn’t contain proper values of economic damage, but pairs of a value (RPOPDMG, CROPDMG) and it’s exponent (PROPDMGEXP, CROPDMGEXP). We need to transform them to simple numeric values. We use numeric exponents as they are. For alphabetic characters in exponents we use rules: ‘b’ and ‘B’ stand for billion, ‘m’ and ‘M’ for million, ‘k’ and ‘K’ for thousand, ‘h’ and ‘H’ for hundred (page 12 of Storm Data Documentation). Also we calculate values in millions for convenience.

pd <- dt2$PROPDMG
pde <- dt2$PROPDMGEXP
cd <- dt2$CROPDMG
cde <- dt2$CROPDMGEXP

pde.n <- as.numeric(pde)
## Warning: NAs introduced by coercion
pd <- pd * 10 ** replace(pde.n, is.na(pde.n), 0)
pd[pde %in% "B"] <- pd[pde %in% "B"] * 1e9
pd[pde %in% c("M", "m")] <- pd[pde %in% c("M", "m")] * 1e6
pd[pde %in% c("K")] <- pd[pde %in% c("K")] * 1e3
pd[pde %in% c("H", "h")] <- pd[pde %in% c("H", "h")] * 1e2
pd[!(pde %in% c("B", "M", "m", "K", "H", "h"))] <- pd[!(pde %in% c("B", "M", 
                                                                   "m", "K", "H", "h"))] * 1

cde.n <- as.numeric(cde)
## Warning: NAs introduced by coercion
cd <- cd * 10 ** replace(cde.n, is.na(cde.n), 0)
cd[cde %in% "B"] <- cd[cde %in% "B"] * 1e9
cd[cde %in% c("M", "m")] <- cd[cde %in% c("M", "m")] * 1e6
cd[cde %in% c("K", "k")] <- cd[cde %in% c("K", "k")] * 1e3
cd[!(cde %in% c("B", "M", "m", "K", "k"))] <- cd[!(cde %in% c("B", "M", "m", 
                                                              "K", "k"))] * 1
dt2$PROPDMG <- pd
dt2$CROPDMG <- cd

Now we can aggregate data by event type.

dt2 <- dt2 %>% 
    # Droping the columns with exponents
    select(-c(CROPDMGEXP, PROPDMGEXP)) %>%
    # Grouping by event type
    group_by(EVTYPE) %>%
    # Aggregating by property damage and crops damage
    # also shifting to millions
    summarise(PROPDMG = sum(PROPDMG) / 1e6,
              CROPDMG = sum(CROPDMG) / 1e6) %>%
    # Sorting by sum of property damage and crops damage in descending order
    arrange(desc(PROPDMG + CROPDMG)) %>% 
    # Taking first 10 records
    slice(1:10) %>%
    # Melting crops/property damage by type for plotting
    gather(TYPE, VALUE, CROPDMG:PROPDMG)
## `summarise()` ungrouping output (override with `.groups` argument)
ggp <- ggplot(dt2, 
              aes(x = reorder(EVTYPE, -VALUE), 
                  y = VALUE, fill = TYPE)) + 
    geom_bar(stat = "identity", col = 'black') +
    labs(x = "Type of event", y = "Value (in Millions)") +
    labs(title = "Top 10 Types of Events By Economic Consequences") +
    labs(fill = "Type") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0))

print(ggp)

We can conclude from the graph that floods have greatest economic consequences.

Results

  1. Tornado is the most harmful weather event in the U.S with respect to population health.

  2. Floods have the greatest economic consequences in the U.S.