Reproducible Research Course Project 2

## Analysis of the U.S. National Oceanic and Atmospheric Administrationâ€™s (NOAA) storm database. This project explores the NOAA storm database, which tracks major storms and weather events, to address the most severe types of weather events in the USA, which caused greatest damage to human population in terms of fatalities/injuries and economic loss during the years 1950 - 2011.

There are two goals of this analysis:

## - identify the weather events that are most harmful with respect to population health
## - identify the weather events that have the greatest economic consequences.

Based on our analysis

## We conclude that TORNADOS and FLOODS are most harmful weather events in the USA in terms of the risk to human health and economic impact.

Data Processing

#The data source is in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. It is possible to download the source file from the course web site: Storm Data

# downloading data needed ----------------------------------------------------------------
library(dplyr)
library(ggplot2)
library(data.table)
library(lubridate)
library(rmarkdown)

Url_data <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
filename <- "repdata_data_StormData.csv.bz2"
download.file(Url_data, filename)
# reading data
My_data <- fread(file = filename, sep = "auto", header = TRUE)
My_data <- data.table(My_data) #transfer back to data.table

Additional documentation on the database was provided here:

1. According to NOAA, the data recording start from Jan. 1950. At that time, they recorded only one event type - tornado. They added more events gradually, and only from Jan 1996 they started recording all events type. Since our objective is comparing the effects of different weather events, we need only to include events that started not earlier than Jan 1996.

 # Change date formats and filter data for dates
  My_data$BGN_DATE <- mdy_hms(My_data$BGN_DATE)
  My_data <- My_data[BGN_DATE > "1995-12-31"]

2. Based on the above mentioned documentation and preliminary exploration of raw data with ‘str’, ‘names’, ‘table’, ‘dim’, ‘head’, ‘range’ and other similar functions we can conclude that there are 7 variables we are interested in regarding the two questions.

Namely: EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP.
Therefore, we can limit our data to these variables.

# Select the needed columns
  My_data <- My_data[, colnames(My_data) %in% 
                       c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")
                     , with=FALSE]

Contents of data now are as follows:

EVTYPE = type of event
FATALITIES = number of fatalities
INJURIES = number of injuries
PROPDMG = the size of property damage
PROPDMGEXP = the exponent values for ‘PROPDMG’ (property damage)
CROPDMG = the size of crop damage
CROPDMGEXP = the exponent values for ‘CROPDMG’ (crop damage)

3. There are almost 1000 unique event types in EVTYPE column. Therefore, it is better to limit database to a reasonable number. We can make it by capitalizing all letters in EVTYPE column as well as subsetting only non-zero data regarding our target numbers.

#cleaning event types names
  My_data$EVTYPE <- toupper(My_data$EVTYPE)
  
  # eliminating zero data
  My_data <- My_data[FATALITIES != 0 &
                       INJURIES != 0 & 
                       PROPDMG != 0 & 
                       CROPDMG != 0]

Population health data processing

We aggregate fatalities and injuries numbers in order to identify TOP-10 events contributing the total people loss:

  #pivot table with dplyr
  My_data <- data.frame(My_data) #transfer back to data.frame
  Health_data <- My_data %>% group_by(EVTYPE) %>%
                             summarise(FATALITIES = sum(FATALITIES),
                                       INJURIES = sum(INJURIES))

## `summarise()` ungrouping output (override with `.groups` argument)

  Health_data <- data.table(Health_data)  #transfer back to data.table
  Health_data <- Health_data[, PEOPLE_LOSS := FATALITIES + INJURIES, by = "EVTYPE"]
  # descending order by PEOPLE_LOSS                           
  Health_data <- Health_data[order(Health_data$PEOPLE_LOSS, decreasing = TRUE), ]
  #top 10 by PEOPLE_LOSS
  Top10.EVTYPE.People <- top_n(Health_data, 10)

## Selecting by PEOPLE_LOSS

knitr::kable(Top10.EVTYPE.People, format = "markdown")

EVTYPE	FATALITIES	INJURIES	PEOPLE_LOSS
FLOOD	37	2487	2524
TORNADO	160	1332	1492
HURRICANE/TYPHOON	22	884	906
TROPICAL STORM	7	266	273
FLASH FLOOD	18	214	232
TSUNAMI	32	129	161
WILDFIRE	31	124	155
EXCESSIVE HEAT	46	18	64
HIGH WIND	10	51	61
HEAVY SNOW	4	38	42

Economic consequences data processing

The number/letter in the exponent value columns (PROPDMGEXP and CROPDMGEXP) represents the power of ten (10^The number). It means that the total size of damage is the product of PROPDMG and CROPDMG and figure 10 in the power corresponding to exponent value.

Exponent values are:

numbers from one to ten
letters (B or b = Billion, M or m = Million, K or k = Thousand, H or h = Hundred)
and symbols “-”, “+” and “?” which refers to less than, greater than and low certainty. We have the option to ignore these three symbols altogether.

We transform letters and symbols to numbers:

 #transform letters and symbols to numbers
  My_data$PROPDMGEXP <- gsub("[Hh]", "2", My_data$PROPDMGEXP)
  My_data$PROPDMGEXP <- gsub("[Kk]", "3", My_data$PROPDMGEXP)
  My_data$PROPDMGEXP <- gsub("[Mm]", "6", My_data$PROPDMGEXP)
  My_data$PROPDMGEXP <- gsub("[Bb]", "9", My_data$PROPDMGEXP)
  My_data$PROPDMGEXP <- gsub("\\+", "1", My_data$PROPDMGEXP)
  My_data$PROPDMGEXP <- gsub("\\?|\\-|\\ ", "0",  My_data$PROPDMGEXP)
  My_data$PROPDMGEXP <- as.numeric(My_data$PROPDMGEXP)
  
  My_data$CROPDMGEXP <- gsub("[Hh]", "2", My_data$CROPDMGEXP)
  My_data$CROPDMGEXP <- gsub("[Kk]", "3", My_data$CROPDMGEXP)
  My_data$CROPDMGEXP <- gsub("[Mm]", "6", My_data$CROPDMGEXP)
  My_data$CROPDMGEXP <- gsub("[Bb]", "9", My_data$CROPDMGEXP)
  My_data$CROPDMGEXP <- gsub("\\+", "1", My_data$CROPDMGEXP)
  My_data$CROPDMGEXP <- gsub("\\-|\\?|\\ ", "0", My_data$CROPDMGEXP)
  My_data$CROPDMGEXP <- as.numeric(My_data$CROPDMGEXP)

At last, we create new values of total property damage and total crop damage for analysis (we need â€˜dplrâ€™ package for that).

#creating total damage values
 My_data$PROPDMGEXP[is.na(My_data$PROPDMGEXP)] <- 0
  My_data$CROPDMGEXP[is.na(My_data$CROPDMGEXP)] <- 0
  
  #Total damage values
  My_data <- mutate(My_data, 
                      PROPDMGTOTAL = PROPDMG * (10 ^ PROPDMGEXP), 
                      CROPDMGTOTAL = CROPDMG * (10 ^ CROPDMGEXP))

Now we aggregate property and crop damage numbers in order to identify TOP-10 events contributing the total economic loss:

 #Economic_data: Let us now analyze the date from above
  #pivot table with dplyr
  Economic_data <- My_data %>% group_by(EVTYPE) %>%
                               summarise(PROPDMGTOTAL = sum(PROPDMGTOTAL),
                                         CROPDMGTOTAL = sum(CROPDMGTOTAL))

## `summarise()` ungrouping output (override with `.groups` argument)

  Economic_data <- data.table(Economic_data)  #transfer back to data.table
  Economic_data <- Economic_data[, ECONOMIC_LOSS := PROPDMGTOTAL + CROPDMGTOTAL, by = "EVTYPE"]
  # descending order by ECONOMIC_LOSS    
  Economic_data <- Economic_data[order(Economic_data$ECONOMIC_LOSS, decreasing = TRUE), ]
  #top 10 by ECONOMIC_LOSS   
  Top10.EVTYPE.economy <- top_n(Economic_data, 10)

## Selecting by ECONOMIC_LOSS

knitr::kable(Top10.EVTYPE.economy, format = "markdown")

EVTYPE	PROPDMGTOTAL	CROPDMGTOTAL	ECONOMIC_LOSS
HURRICANE/TYPHOON	11300000000	1795000000	13095000000
WILDFIRE	1165120000	75150000	1240270000
HIGH WIND	948190000	222930000	1171120000
TORNADO	1040902000	42920000	1083822000
TROPICAL STORM	628470000	121690000	750160000
EXCESSIVE HEAT	170000	492400000	492570000
HURRICANE	140250000	127000000	267250000
FLOOD	210500000	12180500	222680500
FLASH FLOOD	94657000	2235000	96892000
TSUNAMI	81000000	20000	81020000

Results

Analyzing population health impact on the graph one can conclude that TORNADOS, EXCESSIVE HEAT and FLOOD are the main contributors to deaths and injuries out of all event types of weather events.

 #plotting health loss -> HL
 
  HL <- ggplot(data = Top10.EVTYPE.People, aes(x = reorder(EVTYPE, PEOPLE_LOSS), y = PEOPLE_LOSS)) +
        geom_bar(stat = "identity", colour = "black") +
        labs(title = "USA total people loss by weather events in 1996-2011") +
        theme(plot.title = element_text(hjust = 0.5)) +
        labs(y = "Number of fatalities and injuries", x = "Event Type") +
        coord_flip()
 
  HL

Analyzing economic impact on the graph one can conclude that FLOOD, HURRICANE/TYPHOON and STORM SURGE are the main contributors to severe economic consequences out of all event types of weather events.

#plotting economic loss -> EL
  
  EL <- ggplot(data = Top10.EVTYPE.economy, aes(x = reorder(EVTYPE, ECONOMIC_LOSS), y = ECONOMIC_LOSS)) +
        geom_bar(stat = "identity", colour = "black") +
        labs(title = "USA total economic loss by weather events in 1996-2011") +
        theme(plot.title = element_text(hjust = 0.5)) +
        labs(y = "Size of property and crop loss", x = "Event Type") +
        coord_flip()
 
  EL