Tornado and flood are most harmful events

Data Processing

Dataset is available here: https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2

FAQ: https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2FNCDC%20Storm%20Events-FAQ%20Page.pdf

Functions

First of all we should initialize functions. They will help us later:

##functions
loadpackage <- function (name) 
{
        if (!require(name, character.only = T)){
                install.packages(name)
                library(package = name, character.only = T)
        }   
}

drawpie <- function(slices, main){
        labels <- character()
        # slices <- groupedDF[groupedDF$column>0]$column
        pie(slices, labels, main = paste(main, "(Total:",length(slices),")"))    
}

order_by_totaldmg_per <- function (df){
        df <- df[order(df$PercentTotalDamage, decreasing = T)]
}

set_totaldmg <- function(df, col1, col2){
        df$TotalDamage <- col1+col2
        df$PercentTotalDamage <- round(df$TotalDamage*100/sum(df$TotalDamage),2)
        return(df)
}


set_exp <- function(col_exp){
       
        col_exp <- gsub("K","1000",col_exp, ignore.case = T)
        col_exp<- gsub("M","1e6",col_exp, ignore.case = T)
        col_exp <- gsub("B","1e9",col_exp, ignore.case = T)
        col_exp <- as.numeric(col_exp)
        return(col_exp)
        #col_val <- col_exp*col_val
}

Packages

Then we must load neccessary packages, using “loadpackage” function:

#loading packages
loadpackage("data.table")
loadpackage("R.utils")
loadpackage("dplyr")
loadpackage("ggplot2")
loadpackage("knitr")

Reading data

Now we can load, unzip and read data. We use “fread” function because it’s faster than read.csv. Don’t forget to set your working directory!

#load data
filename <- "stormdata.csv.bz2"
filenameCSV <- "stormdata.csv"

if (!file.exists(filename)){
        download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", 
                      filename)
        bunzip2(filename, filenameCSV, remove = FALSE, skip = FALSE)    
}

DataStorm <- fread(filenameCSV, sep=",", header = T)

Data cleaning

According to documentation of this database We are interested in four variables:

Population health:
```
Fatalities 
Injuries
```

Economy:

PROPDMG (property damage)
CROPDMG (crop damage)

There are different extents for economical variables.

B - billions
M - millions
K - thousands

It seems better to ignore rows with another symbols in extents columns.

DataStorm <- subset(DataStorm, DataStorm$CROPDMGEXP %in% c("","b","B","k","K","m","M") &
                            DataStorm$PROPDMGEXP %in% c("","b","B","k","K","m","M"))

Now we can substitute characters with digits.

DataStorm[DataStorm$PROPDMGEXP==""]$PROPDMGEXP <- 0
DataStorm$PROPDMGEXP <- set_exp(DataStorm$PROPDMGEXP)
DataStorm$PROPDMG <- DataStorm$PROPDMGEXP*DataStorm$PROPDMG

DataStorm[DataStorm$CROPDMGEXP==""]$CROPDMGEXP <- 0
DataStorm$CROPDMGEXP <- set_exp(DataStorm$CROPDMGEXP)
DataStorm$CROPDMG <- DataStorm$CROPDMGEXP*DataStorm$CROPDMG

All economical variables are calculated with the same extent now. So we can operate with them.

Data summarising and first plots

First of all we summarise our dataset.

groupedDF <- DataStorm %>% group_by(EVTYPE)  %>%
        summarise(FATALITIES=sum(FATALITIES), INJURIES=sum(INJURIES), 
                  PROPDMG= sum(PROPDMG),CROPDMG= sum(CROPDMG))

How much different events are in dataset?

events_count <- nrow(groupedDF)

981 . That’s a lot enough. But we shouldn’t explore every single type. We are interested only in “top” cases.

So we should check event’s damages distribution. It will be handy with pie charts (in breaks - total count of events for this damage type):

#pie diagrams
par(mfrow=c(2,2), mar= c(1,1,1,1))
drawpie(groupedDF[groupedDF$FATALITIES>0]$FATALITIES, "Fatalities")
drawpie(groupedDF[groupedDF$INJURIES>0]$INJURIES, "Injuries")
drawpie(groupedDF[groupedDF$PROPDMG>0]$PROPDMG, "Property damage")
drawpie(groupedDF[groupedDF$CROPDMG>0]$CROPDMG, "Crop damage")

Of course these charts are too small to decise something exactly. But we can see, that almost in all cases there is “dominating” event. Contribution of this “leader” event type is almost always on 40-50% level.

So we can analyse only “top”" of events for each case.

Health plot

We create a new dataframe. There are only “health” variables in it (and only “top” event types).

#health
DFHealth <- subset(groupedDF, groupedDF$FATALITIES>quantile(groupedDF$FATALITIES,0.99, na.rm = T)|
                           groupedDF$INJURIES>quantile(groupedDF$INJURIES,0.99,na.rm = T),
                   select=c(EVTYPE, FATALITIES, INJURIES))

And create new variables responsible for total damage and for proportion in sum of total damage.

DFHealth <- set_totaldmg(DFHealth, DFHealth$FATALITIES, DFHealth$INJURIES)

Let’s make a plot:

ggplot(DFHealth, aes(x=DFHealth$FATALITIES,y= DFHealth$INJURIES))+
        geom_point()+ 
        geom_text(aes(label=ifelse(FATALITIES>1000,as.character(EVTYPE),'')), hjust=0)+
        theme_bw()+
        xlim(c(0,7000))+
        xlab("Fatalities")+
        ylab("Injuries")+
        ggtitle("Population health consequences of different events")

We can see that tornado is leading with large odds. No doubt, tornado is the most awful thing for population health.

Economy plot

Let’s do the same operations with “economy” variables:

#Economy
DFEconomy <- subset(groupedDF, groupedDF$PROPDMG>quantile(groupedDF$CROPDMG,0.99,na.rm = T)|
                            groupedDF$CROPDMG>quantile(groupedDF$CROPDMG,0.99,na.rm = T),
                    select = c(EVTYPE,PROPDMG,CROPDMG))

#cut values (they are too large)
DFEconomy$PROPDMG <- round(DFEconomy$PROPDMG/1e9,2)
DFEconomy$CROPDMG <- round(DFEconomy$CROPDMG/1e9,2)
DFEconomy <- set_totaldmg(DFEconomy, DFEconomy$PROPDMG, DFEconomy$CROPDMG)

ggplot(DFEconomy, aes(x=PROPDMG,y= CROPDMG))+
        geom_point()+ 
        geom_text(aes(label=ifelse(PROPDMG>50|CROPDMG>6,as.character(EVTYPE),'')), 
                  hjust=0)+
        theme_bw()+
        xlab("Property damage, billions of $")+
        ylab("Crop damage, billions of $")+
        xlim(c(0,160))+
        ggtitle("Economic consequences of different events")

Drought is more danger for crop than anything else. But according to property damage, flood is more scary event (note that absolute values are greater than in crop damages). Hurricane and tornado are very dangerous too.

Results

We have defined more harmful events. In the following tables we can see the top of damage reasons ordered by their contribution in total damage. For population health:

#Health table

DFHealth <- order_by_totaldmg_per(DFHealth)

kable(DFHealth[DFHealth$PercentTotalDamage>1], format = "markdown", 
      col.names = c("Event","Fatalities",
                    "Injuries","Total damage", "Total damage, %"))

Event	Fatalities	Injuries	Total damage	Total damage, %
TORNADO	5630	91321	96951	69.56
EXCESSIVE HEAT	1903	6525	8428	6.05
TSTM WIND	504	6957	7461	5.35
FLOOD	470	6789	7259	5.21
LIGHTNING	816	5230	6046	4.34
HEAT	937	2100	3037	2.18
FLASH FLOOD	978	1777	2755	1.98
ICE STORM	89	1975	2064	1.48
THUNDERSTORM WIND	133	1488	1621	1.16

And for economy:

#Economy table

DFEconomy <- order_by_totaldmg_per(DFEconomy)

kable(DFEconomy[DFEconomy$PercentTotalDamage>1], format = "markdown", 
      col.names = c("Event","Property damage $B", "Crop damage $B",
                    "Total damage $B", "Total damage, %"))

Event	Property damage $B	Crop damage $B	Total damage $B	Total damage, %
FLOOD	144.66	5.66	150.32	32.29
HURRICANE/TYPHOON	69.31	2.61	71.92	15.45
TORNADO	56.94	0.36	57.30	12.31
STORM SURGE	43.32	0.00	43.32	9.31
HAIL	15.73	3.00	18.73	4.02
FLASH FLOOD	16.14	1.42	17.56	3.77
DROUGHT	1.05	13.97	15.02	3.23
HURRICANE	11.87	2.74	14.61	3.14
RIVER FLOOD	5.12	5.03	10.15	2.18
ICE STORM	3.94	5.02	8.96	1.92
TROPICAL STORM	7.70	0.68	8.38	1.80
WINTER STORM	6.69	0.03	6.72	1.44
HIGH WIND	5.27	0.64	5.91	1.27
WILDFIRE	4.77	0.30	5.07	1.09
TSTM WIND	4.48	0.55	5.03	1.08