knitr::opts_chunk$set(echo = TRUE)
library(data.table)
library(tidyverse)

## -- Attaching packages --------------------------------------------------------------------------------- tidyverse 1.2.1 --

## v ggplot2 2.2.1     v purrr   0.2.4
## v tibble  1.3.4     v dplyr   0.7.4
## v tidyr   0.7.2     v stringr 1.2.0
## v readr   1.1.1     v forcats 0.2.0

## -- Conflicts ------------------------------------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::between()   masks data.table::between()
## x dplyr::filter()    masks stats::filter()
## x dplyr::first()     masks data.table::first()
## x dplyr::lag()       masks stats::lag()
## x dplyr::last()      masks data.table::last()
## x purrr::transpose() masks data.table::transpose()

Synopsis

Main Goal:

The main goal of this report is to answer the following questions:

Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?

To answer this questions the storm data from here for more details about the data set find the documentation here. The data will be preprocessed and prepared for answering the two questions. In this process the data are prosssed and analyzed.

To answer the main quesitons of this document first a date variable is created and in a second preperation step financial impact of each event is calcualted in a new variable.

After that two graphes are used to answer those main questions.

Data Processing

Data set up

This step of the analysis focus on loading the data set and preparing the data for all further analysis.
The goal is to have a data set where only data aggregation and specific preperations per questions are needed. All basics should be ready after that step.

Data loading and pre processing

Data are avaiable as csv and are transformed into data.table format

stormDT<-read.csv("repdata%2Fdata%2FStormData.csv")
setDT(stormDT)

Data preperation and setting up analysis data set

As part of data preperation a new date + time column is written.
Writing new financial impact variable as sum of Crop and Prod multiplied by the unit indicator to bring everything on the same scale.
All using data.tables self referencing function. That is very resource efficiant and works smoothly with bigger data sets.

stormDT[,BGN_DATETIME:=as.POSIXct(paste(as.Date(BGN_DATE, "%m/%d/%Y"), BGN_TIME),
                              "%Y-%m-%d %H%M", 
                              tz= "")]

stormDT[, CROPDMGEXP:=as.character(tolower(CROPDMGEXP))]
stormDT[!CROPDMGEXP %in% c("b", "k", "m"), CROPDMGEXP:=NA]
stormDT[, PROPDMGEXP:=as.character(tolower(PROPDMGEXP))]
stormDT[!PROPDMGEXP %in% c("b", "k", "m"), PROPDMGEXP:=NA]
stormDT[, economics:= PROPDMG * 
              ifelse(PROPDMGEXP=="m", 1000000,ifelse(PROPDMGEXP=="b", 1000000000, 1000)) +
              CROPDMG * 
              ifelse(CROPDMGEXP=="m", 1000000,ifelse(CROPDMGEXP=="b", 1000000000, 1000))]

Results

Answering main Questions

Questions 1

Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

Looking at both injuries and fatalaties across all US states

plotDT<-
      stormDT[,
              .(
                    Injuries=sum(INJURIES, na.rm = T), 
                    Fatalities=sum(FATALITIES, na.rm=T)
              ),
              EVTYPE][
                    order(Injuries, decreasing = T)][1:10]
# factor reordering by size of Injuries
plotDT$EVTYPE<-factor(plotDT$EVTYPE, levels = plotDT[,as.character(EVTYPE)], labels = plotDT[,as.character(EVTYPE)])
plotDT<-melt(plotDT, id.vars = "EVTYPE")
ggplot(plotDT)+
      geom_bar(aes(x=EVTYPE, fill=variable, y=value),
               stat = "identity",
               position = "stack")+
      theme_bw()+
      scale_fill_brewer(palette = 7,
                        type = "div")+
      scale_y_continuous(labels = function(x)format(x,big.mark = ","))+
      labs(title="Top 10 events by injuries and fatalities", y = "# cases with impact on health", x="Events", fill = "Case type")

Looking at the graphe makes it quite obvious that Tornados cause the most injuries and fatalities by far. No other event is close to that.

Questions 2

Across the United States, which types of events have the greatest economic consequences?

Defining economic consequences as a financial impact

plotDT<-
      stormDT[, sum(economics, na.rm = T)/1000000000, EVTYPE
        ][order(V1, decreasing = T)
          ][1:10]

plotDT$EVTYPE<-factor(plotDT$EVTYPE, 
                      levels = plotDT$EVTYPE, 
                      labels = plotDT$EVTYPE )

ggplot(plotDT)+
      geom_bar(aes(x=EVTYPE, y=V1),
               stat = "identity",
               position = "stack")+
      labs(title="Top 10 events by financial impact", y="Financial impact (in $ billions)", x="Events")

The highest financial impact have flood with very much distance to second place Hurrican and Typhoon. The event that was in first place for fatalitys and injuries Tornado is now place three.

Course Project 2: Looking at Storm data