Synopsis

This document is devoted to analysis data from U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. The main issue of research was to define most harmful events with respect to population health and to economy. There are a lot of different types of events stored in dataset, but in both cases (population health and economy) only several event types causing largest part of harm and destructions. Tornado is undoubted “leader” in “population health” category. Almost 70% of deaths and injuries are related with it. There is more equally situation with economy damages. Floods, hurricanes and tornados (again) are reason for almost 50% of all damage.

Data Processing

Dataset is available here: https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2

FAQ: https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2FNCDC%20Storm%20Events-FAQ%20Page.pdf

Functions

First of all we should initialize functions. They will help us later:

##functions
loadpackage <- function (name) 
{
        if (!require(name, character.only = T)){
                install.packages(name)
                library(package = name, character.only = T)
        }   
}

drawpie <- function(slices, main){
        labels <- character()
        # slices <- groupedDF[groupedDF$column>0]$column
        pie(slices, labels, main = paste(main, "(Total:",length(slices),")"))    
}

order_by_totaldmg_per <- function (df){
        df <- df[order(df$PercentTotalDamage, decreasing = T)]
}

set_totaldmg <- function(df, col1, col2){
        df$TotalDamage <- col1+col2
        df$PercentTotalDamage <- round(df$TotalDamage*100/sum(df$TotalDamage),2)
        return(df)
}


set_exp <- function(col_exp){
       
        col_exp <- gsub("K","1000",col_exp, ignore.case = T)
        col_exp<- gsub("M","1e6",col_exp, ignore.case = T)
        col_exp <- gsub("B","1e9",col_exp, ignore.case = T)
        col_exp <- as.numeric(col_exp)
        return(col_exp)
        #col_val <- col_exp*col_val
}

Packages

Then we must load neccessary packages, using “loadpackage” function:

#loading packages
loadpackage("data.table")
loadpackage("R.utils")
loadpackage("dplyr")
loadpackage("ggplot2")
loadpackage("knitr")

Reading data

Now we can load, unzip and read data. We use “fread” function because it’s faster than read.csv. Don’t forget to set your working directory!

#load data
filename <- "stormdata.csv.bz2"
filenameCSV <- "stormdata.csv"

if (!file.exists(filename)){
        download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", 
                      filename)
        bunzip2(filename, filenameCSV, remove = FALSE, skip = FALSE)    
}

DataStorm <- fread(filenameCSV, sep=",", header = T)

Data cleaning

According to documentation of this database We are interested in four variables:

  • Population health:

    Fatalities 
    Injuries
  • Economy:

    PROPDMG (property damage)
    CROPDMG (crop damage)

There are different extents for economical variables.

  • B - billions
  • M - millions
  • K - thousands

It seems better to ignore rows with another symbols in extents columns.

DataStorm <- subset(DataStorm, DataStorm$CROPDMGEXP %in% c("","b","B","k","K","m","M") &
                            DataStorm$PROPDMGEXP %in% c("","b","B","k","K","m","M"))

Now we can substitute characters with digits.

DataStorm[DataStorm$PROPDMGEXP==""]$PROPDMGEXP <- 0
DataStorm$PROPDMGEXP <- set_exp(DataStorm$PROPDMGEXP)
DataStorm$PROPDMG <- DataStorm$PROPDMGEXP*DataStorm$PROPDMG

DataStorm[DataStorm$CROPDMGEXP==""]$CROPDMGEXP <- 0
DataStorm$CROPDMGEXP <- set_exp(DataStorm$CROPDMGEXP)
DataStorm$CROPDMG <- DataStorm$CROPDMGEXP*DataStorm$CROPDMG

All economical variables are calculated with the same extent now. So we can operate with them.

Data summarising and first plots

First of all we summarise our dataset.

groupedDF <- DataStorm %>% group_by(EVTYPE)  %>%
        summarise(FATALITIES=sum(FATALITIES), INJURIES=sum(INJURIES), 
                  PROPDMG= sum(PROPDMG),CROPDMG= sum(CROPDMG))

How much different events are in dataset?

events_count <- nrow(groupedDF)

981 . That’s a lot enough. But we shouldn’t explore every single type. We are interested only in “top” cases.

So we should check event’s damages distribution. It will be handy with pie charts (in breaks - total count of events for this damage type):

#pie diagrams
par(mfrow=c(2,2), mar= c(1,1,1,1))
drawpie(groupedDF[groupedDF$FATALITIES>0]$FATALITIES, "Fatalities")
drawpie(groupedDF[groupedDF$INJURIES>0]$INJURIES, "Injuries")
drawpie(groupedDF[groupedDF$PROPDMG>0]$PROPDMG, "Property damage")
drawpie(groupedDF[groupedDF$CROPDMG>0]$CROPDMG, "Crop damage")

Of course these charts are too small to decise something exactly. But we can see, that almost in all cases there is “dominating” event. Contribution of this “leader” event type is almost always on 40-50% level.

So we can analyse only “top”" of events for each case.

Health plot

We create a new dataframe. There are only “health” variables in it (and only “top” event types).

#health
DFHealth <- subset(groupedDF, groupedDF$FATALITIES>quantile(groupedDF$FATALITIES,0.99, na.rm = T)|
                           groupedDF$INJURIES>quantile(groupedDF$INJURIES,0.99,na.rm = T),
                   select=c(EVTYPE, FATALITIES, INJURIES))

And create new variables responsible for total damage and for proportion in sum of total damage.

DFHealth <- set_totaldmg(DFHealth, DFHealth$FATALITIES, DFHealth$INJURIES)

Let’s make a plot:

ggplot(DFHealth, aes(x=DFHealth$FATALITIES,y= DFHealth$INJURIES))+
        geom_point()+ 
        geom_text(aes(label=ifelse(FATALITIES>1000,as.character(EVTYPE),'')), hjust=0)+
        theme_bw()+
        xlim(c(0,7000))+
        xlab("Fatalities")+
        ylab("Injuries")+
        ggtitle("Population health consequences of different events")

We can see that tornado is leading with large odds. No doubt, tornado is the most awful thing for population health.

Economy plot

Let’s do the same operations with “economy” variables:

#Economy
DFEconomy <- subset(groupedDF, groupedDF$PROPDMG>quantile(groupedDF$CROPDMG,0.99,na.rm = T)|
                            groupedDF$CROPDMG>quantile(groupedDF$CROPDMG,0.99,na.rm = T),
                    select = c(EVTYPE,PROPDMG,CROPDMG))

#cut values (they are too large)
DFEconomy$PROPDMG <- round(DFEconomy$PROPDMG/1e9,2)
DFEconomy$CROPDMG <- round(DFEconomy$CROPDMG/1e9,2)
DFEconomy <- set_totaldmg(DFEconomy, DFEconomy$PROPDMG, DFEconomy$CROPDMG)

ggplot(DFEconomy, aes(x=PROPDMG,y= CROPDMG))+
        geom_point()+ 
        geom_text(aes(label=ifelse(PROPDMG>50|CROPDMG>6,as.character(EVTYPE),'')), 
                  hjust=0)+
        theme_bw()+
        xlab("Property damage, billions of $")+
        ylab("Crop damage, billions of $")+
        xlim(c(0,160))+
        ggtitle("Economic consequences of different events")

Drought is more danger for crop than anything else. But according to property damage, flood is more scary event (note that absolute values are greater than in crop damages). Hurricane and tornado are very dangerous too.

Results

We have defined more harmful events. In the following tables we can see the top of damage reasons ordered by their contribution in total damage. For population health:

#Health table

DFHealth <- order_by_totaldmg_per(DFHealth)

kable(DFHealth[DFHealth$PercentTotalDamage>1], format = "markdown", 
      col.names = c("Event","Fatalities",
                    "Injuries","Total damage", "Total damage, %"))
Event Fatalities Injuries Total damage Total damage, %
TORNADO 5630 91321 96951 69.56
EXCESSIVE HEAT 1903 6525 8428 6.05
TSTM WIND 504 6957 7461 5.35
FLOOD 470 6789 7259 5.21
LIGHTNING 816 5230 6046 4.34
HEAT 937 2100 3037 2.18
FLASH FLOOD 978 1777 2755 1.98
ICE STORM 89 1975 2064 1.48
THUNDERSTORM WIND 133 1488 1621 1.16

And for economy:

#Economy table

DFEconomy <- order_by_totaldmg_per(DFEconomy)

kable(DFEconomy[DFEconomy$PercentTotalDamage>1], format = "markdown", 
      col.names = c("Event","Property damage $B", "Crop damage $B",
                    "Total damage $B", "Total damage, %"))
Event Property damage $B Crop damage $B Total damage $B Total damage, %
FLOOD 144.66 5.66 150.32 32.29
HURRICANE/TYPHOON 69.31 2.61 71.92 15.45
TORNADO 56.94 0.36 57.30 12.31
STORM SURGE 43.32 0.00 43.32 9.31
HAIL 15.73 3.00 18.73 4.02
FLASH FLOOD 16.14 1.42 17.56 3.77
DROUGHT 1.05 13.97 15.02 3.23
HURRICANE 11.87 2.74 14.61 3.14
RIVER FLOOD 5.12 5.03 10.15 2.18
ICE STORM 3.94 5.02 8.96 1.92
TROPICAL STORM 7.70 0.68 8.38 1.80
WINTER STORM 6.69 0.03 6.72 1.44
HIGH WIND 5.27 0.64 5.91 1.27
WILDFIRE 4.77 0.30 5.07 1.09
TSTM WIND 4.48 0.55 5.03 1.08