Coursera - Reproducible Research #2 - Impact of Weather Events

1. SYNOPSIS

This project is the last assignment of the Coursera course on Reproducible Research.
The goal is to answer general questions about severe weather events, their impact on population health and economic consequences.

We’ll explore the NOAA Storm Database and will subset only the parts relevant to the questions.

The analysis consists of data processing to make the subset as tidy as possible and the plotting of the impact severe weather events.

In all answers, we show the top 10 weather events with the most effect on the populations and the economy.

2. DATA PROCESSING

2.1 Data

The data for this assignment come from the NCDC (National Climatic Data Center), collecting storm data events from various sources, with record from 1950 to November 2011. Documentation about the dataset:
National Weather Service Storm Data Documentation

National Climatic Data Center Storm Events FAQ

The dataset comprises 902,297 observations across 37 variables indicating dates of events, location (states, cities, longitude, etc), type of event, and impact of events (fatalities & damages).

2.2 Downloading file and loading the data

#store download link and zip file name in variables
DownloadLink <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
ZipFile <- "repdata%2Fdata%2FStormData.csv.bz2"
#download file after checking it isn't already in  reader's current working directory
if(!file.exists(ZipFile)) {
    download.file(DownloadLink, destfile = ZipFile)
    }

#load data if dataset NCDC_data hasn't been loaded yet
if(!exists("NCDC_data")){
    NCDC_data <- read.csv(bzfile("repdata%2Fdata%2FStormData.csv.bz2"))
}

2.3 Exploring the data

The dataset is pretty heavy on the system and we’ll subset only the data relevant to the questions.
We need to explore the format of each variable of interest.

#view names of variables
names(NCDC_data)

##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"

#we further exectuted the unique(NCDC_data$VAR_NAME) function on each variable, which we don't code here in order to save space. Please see comments below.

We’ll retain only the following variables:
- EVTYPE: type of event, factor object with 985 levels (event types), whereas the documentation mentions 48 types of principal events. We count many duplicates in labeling events, typos, labels in lowercase that count duplicate, and also some self-made labels like “summary of July 16, …”, as well as very granular types of events.
- FATALITIES: 52 values of human casualties ranging from 0 to 583.
- INJURIES: 200 values, ranging from 0 to 1700.
- PROPDMG: Property damage, unit = USD, the value is given an exponent value for sicentific notation or letters, (see next variable)
- PROPDMGEXP: the exponent to apply to property damage value in the PROPDMG variable. Factor with 19 levels (no indication, ?, 0, 1, … k (thousands), m (millions), …)
- CROPDMG: Crop damage, unit = USD, exponent value given in following variable
- CROPDMGEXP: the exponent to apply to property damage value in the CROPDMG variable. Factor with 9 levels (no indication, ?, 0, 1, … k (thousands), m (millions), …)

2.4 Pre-processing data

Since the quantity of data is heavy on the system, we’ll subset the main dataset and create 2 separate ones:
- dmg_data: will retain 5 variables, EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, and CROPDMGEXP
- population_data <- we’ll retain 3 variables, EVTYPE, FATALITIES & INJURIES In both cases, we’ll retain observations with values superior to 0

dmg_data <- NCDC_data[NCDC_data$PROPDMG > 0 | NCDC_data$CROPDMG > 0,c("EVTYPE", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")]
population_data <- NCDC_data[NCDC_data$FATALITIES > 0 | NCDC_data$INJURIES > 0,c("EVTYPE", "FATALITIES", "INJURIES")]

Some damage estimates (PROPDMG and CROPDMG variables) are assorted with a scientific notation (10^n) or by the letters h, h, m or b (PROPDMGEXP and CROPDMGEXP variables).
We need to put all property and crop damages into a single unit value. First let’s take a look at the different exponent values in the datasets, for PROPDMG and CROPDMG values superior to 0.

table(dmg_data$PROPDMGEXP[dmg_data$PROPDMG > 0])

## 
##             -      ?      +      0      1      2      3      4      5 
##     76      1      0      5    209      0      1      1      4     18 
##      6      7      8      B      h      H      K      m      M 
##      3      2      0     40      1      6 227481      7  11319

table(dmg_data$CROPDMGEXP[dmg_data$CROPDMG > 0])

## 
##           ?     0     2     B     k     K     m     M 
##     3     0    12     0     7    21 20137     1  1918

Add PROPDMG_PROPER & CROPDMG_PROPER columns, factoring in the exponential value:

#Create function "exp2 to convert all exp. values from the tables above
#h, k, m & b (and uppercases) will convert to 10^2, 10^3, ...
#numbers will be converted from characters to numbers
#all other characters will be put as 0 (10^0 = 1)
exp <- function(x){
    if(x == "h" | x == "H")
        return(2)
    else if(x == "k" | x == "K")
        return(3)
    else if(x == "m" | x == "M")
        return(6)
    else if(x == "b" | x =="B")
        return(9)
    else if(!is.na(as.numeric(x)))
        return(as.numeric(x))
    else if(is.na(as.numeric(x)))
        return(0)
}
#add PROPDMG_PROPER and CROPDMG_PROPER columns factoring in the exponential value
dmg_data$PROPDMG_PROPER <- dmg_data$PROPDMG * 10 ^ sapply(dmg_data$PROPDMGEXP, FUN = exp)
dmg_data$CROPDMG_PROPER <- dmg_data$CROPDMG * 10 ^ sapply(dmg_data$CROPDMGEXP, FUN = exp)

We have now two datasets:
- dmg_data, showing 2 new columns with real (proper) damages values, for property and crops
- population_data, showing fatalities and injuries for all event types

3. RESULTS

3.1 QUESTION 1

Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
We will aggregate two tables, one for Injuries and one for Fatalities:

#prepare the dataset for visualisation
INJURIES_table <- aggregate(INJURIES ~ EVTYPE, data = population_data, FUN =  sum)
FATALITIES_table <- aggregate(FATALITIES ~ EVTYPE, data = population_data, FUN =  sum)
#then reorder both tables starting with the highest value, and retain the 10 highest values only
INJURIES_table <- INJURIES_table[order(-INJURIES_table$INJURIES),][1:10,]
FATALITIES_table <- FATALITIES_table[order(-FATALITIES_table$FATALITIES),][1:10,]

#plot Injuries
library(ggplot2)
injuries_plot <- ggplot(INJURIES_table, aes(x = reorder(EVTYPE, -INJURIES), y = INJURIES), fill = "tan")
injuries_plot + geom_bar(stat = "identity", fill = "deepskyblue3") + labs(x = "Type of Event", y = "Total number of injuries", title = "Health Impact (Injuries) of Weather Events - US") + theme(axis.text.x = element_text(angle = 60, hjust = 1))

#plot Fatalities
fatalities_plot <- ggplot(FATALITIES_table, aes(x = reorder(EVTYPE, -FATALITIES), y = FATALITIES), fill = "tan") 
fatalities_plot + geom_bar(stat = "identity", fill = "darkgreen") + labs(x = "Type of Event", y = "Total number of fatalities", title = "Health Impact (Fatalities) of Weather Events - US") + theme(axis.text.x = element_text(angle = 60, hjust = 1))

In both cases, we see that tornados are by far the weather events causing the most casualties.

3.2 QUESTION 2

Across the United States, which types of events have the greatest economic consequences?
We start by aggregating the damage values for property and crops:

#creating specific datasets for both property and crop damages
dmg_prop <- aggregate(PROPDMG_PROPER ~ EVTYPE, data = dmg_data, FUN = sum)
dmg_crop <- aggregate(CROPDMG_PROPER ~ EVTYPE, data = dmg_data, FUN = sum)
#ordering the tables in descending order and retaining top 10 values
dmg_prop <- dmg_prop[order(-dmg_prop$PROPDMG_PROPER),][1:10,]
dmg_crop <- dmg_crop[order(-dmg_crop$CROPDMG_PROPER),][1:10,]
#plot property damages
dmg_prop_plot <- ggplot(dmg_prop, aes(x = reorder(EVTYPE, -PROPDMG_PROPER), y = PROPDMG_PROPER/1000000000)) + 
    geom_bar(stat = "identity", fill = "deepskyblue3") + labs(x = "Type of Event", y = "Property Damage (billion USD)", title = "Economic Impact (Property) \n of Weather Events - US") + 
    theme(axis.text.x = element_text(angle = 60, hjust = 1)) 
# plot crop damages 
dmg_crop_plot <- ggplot(dmg_crop, aes(x = reorder(EVTYPE, -CROPDMG_PROPER), y = CROPDMG_PROPER/1000000000)) + 
    geom_bar(stat = "identity", fill = "darkgreen") + labs(x = "Type of Event", y = "Crop Damage (billion USD)", title = "Economic Impact (Crop) \n of Weather Events - US") + 
    theme(axis.text.x = element_text(angle = 60, hjust = 1))
#loading gridExtra package to merge both insights on a single figure
library(gridExtra)
grid.arrange(dmg_prop_plot, dmg_crop_plot, ncol = 2, nrow = 1)

4. CONCLUSION

Based on the data analysed and without further correcting the incorrect event types, we notice that the most damaging weather events to population health in the US are tornados. High winds, floods and heat follow tornados in the most severe events.

Regarding the economic impact, about 99.9% of damages concern properties, the smaller portion concerns crops. Flash floods and thunderstorms are by far the factors that cause the biggest economic damage.

Droughts are the primary factor endangering crops.