Synopsis

This report presents the worst weather events across the United States. It is considering their effects in harmful with respect to population health and what this have the greatest economic consequences.

To achieve that, we obtain the storm data from U.S. National Oceanic and Atmospheric Administration (NOAA). We process these data by tidying, in special the event types that have too much data quality issues. We considering a subset of the data since 1996, when we have all the event types recorded in the database, sum the fatalities with injuries to obtain the total casualties and calculate the damage cost consider the magnitude and property damage cost and crop damage cost.

Finally, our results indicate that convection events category are the most harmful with respect to population health, in special tornado, that belong this category, is the worst event type. On the other hand, we need pay more attention with first place of the mortalities caused by excessive heat event type. Also, the tropical cyclones events category have the greatest economic consequences considering the record in 2006, in special the hurricane event types that belong this category, is the greatest ones followed by flood.

Prepare the Environment

Throughout this report you can always find the code that I used to generate my output presents here. When writing code chunks in the R markdown document, always use echo = TRUE so that someone else will be able to read the code. This assignment will be evaluated via peer assessment so it’s essential that my peer evaluators be able to review my code and my analysis together..

First, we set echo equal a TRUE and results equal a ‘hold’ as global options for this document.

library(knitr)
## Warning: package 'knitr' was built under R version 3.1.1
opts_chunk$set(echo = TRUE, results = 'hold')

Next, we load the libraries used in our analysis: ### Load data.table, xtable and ggplot2 libraries

library(data.table)
library(xtable)
library(ggplot2) # we shall use ggplot2 for plotting figures
library(gridExtra) 
## Warning: package 'gridExtra' was built under R version 3.1.1
## Loading required package: grid

We don’t use any random numbers or functions that use it, so we don’t need set a seed.

Next, we show our system environment, maybe it is useful for reproducible proposals:

sessionInfo()
## R version 3.1.0 Patched (2014-04-10 r65396)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## 
## locale:
## [1] LC_COLLATE=Portuguese_Brazil.1252  LC_CTYPE=Portuguese_Brazil.1252   
## [3] LC_MONETARY=Portuguese_Brazil.1252 LC_NUMERIC=C                      
## [5] LC_TIME=Portuguese_Brazil.1252    
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] gridExtra_0.9.1  ggplot2_1.0.0    xtable_1.7-3     data.table_1.9.2
## [5] knitr_1.6       
## 
## loaded via a namespace (and not attached):
##  [1] colorspace_1.2-4 digest_0.6.4     evaluate_0.5.5   formatR_0.10    
##  [5] gtable_0.1.2     htmltools_0.2.4  MASS_7.3-31      munsell_0.4.2   
##  [9] plyr_1.8.1       proto_0.3-10     Rcpp_0.11.2      reshape2_1.4    
## [13] rmarkdown_0.2.46 scales_0.2.4     stringr_0.6.2    tools_3.1.0     
## [17] yaml_2.1.13

Data Processing

This assignment makes use of Storm Data in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size.

There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.

The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.

Data Loading

This assignment instructions request to show any code that is needed to loading and preprocessing the data, like to:

  1. Load the data (i.e. > read.csv())
  2. Process/transform the data (if necessary) into a format suitable for your analysis

Function definitions and load libraries code segment

First we define a function to check if the files exists in the path defined by file_path. If don’t exists stops execution of the current expression, and executes an error action.

check_file_exist <- function(file_path) 
{
        if (!file.exists(file_path))
                stop("The ", file_path, " not found!") else TRUE 
}

Next, we use data set and data_dir to define the file_path, call the check_file_exist function to check, if the file exist send a message to user for waiting the load of file. Finally returned with data set load to data variable.

load_data <- function(data_dir , fileURL, fileSource) 
{
        # Dataset check and load 
        
        file_path <- paste(data_dir, "/", fileSource , sep="")
        
        if (!file.exists(file_path)) {
                message(paste("Please Wait! Download...", fileURL, "..."));
                download.file(fileURL, destfile=source_path);
        } 

        data <- read.csv(bzfile(file_path),
                         header=TRUE,  na.strings=c("NA",""))
        
}

Assign the directory of this analysis

Maybe you need to change this data_dir variable to yours peeress directory, because the line code that ask you inform where the data directory is find, use readline function, not function at markdown documents.

data_dir <- "C:/Users/Marcelo/NOAA_Analysis/Data";

Here, we initiate the main variables and run data load and preparation process for the activate data.

fileURL <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2" 
fileSource <-"repdata-data-StormData.csv.bz2"

raw_data <- load_data(data_dir , fileURL, fileSource) 

At last, we define some functions to simplify the actions for multiple plots in one graphic.

# define function to create multi-plot setup (nrow, ncol)
vp.setup <- function(x,y){

# create a new layout with grid
grid.newpage()

# define viewports and assign it to grid layout
pushViewport(viewport(layout = grid.layout(x,y)))
}

# define function to easily access layout (row, col)
vp.layout <- function(x,y){
viewport(layout.pos.row=x, layout.pos.col=y)
}

Data Exploration

Now, we can proceed with the data pre-examination, first we examine the str:

str(raw_data)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29600 levels "5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13512 1872 4597 10591 4371 10093 1972 23872 24417 4597 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 34 levels "  N"," NW","E",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ BGN_LOCATI: Factor w/ 54428 levels "- 1 N Albion",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ END_DATE  : Factor w/ 6662 levels "1/1/1993 0:00:00",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ END_TIME  : Factor w/ 3646 levels " 0900CST"," 200CST",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 23 levels "E","ENE","ESE",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ END_LOCATI: Factor w/ 34505 levels "- .5 NNW","- 11 ESE Jay",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 18 levels "-","?","+","0",..: 16 16 16 16 16 16 16 16 16 16 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 8 levels "?","0","2","B",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ WFO       : Factor w/ 541 levels " CI","$AC","$AG",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ STATEOFFIC: Factor w/ 249 levels "ALABAMA, Central",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ ZONENAMES : Factor w/ 25111 levels "                                                                                                                               "| __truncated__,..: NA NA NA NA NA NA NA NA NA NA ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436780 levels "-2 at Deer Park\n",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

Then we chech its summary.

xt <- xtable(summary(raw_data))
print(xt, type = "html")
STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
1 Min. : 1.0 5/25/2011 0:00:00: 1202 12:00:00 AM: 10163 CST :547493 Min. : 0 JEFFERSON : 7840 TX : 83728 HAIL :288661 Min. : 0 N : 86752 COUNTYWIDE : 19680 4/27/2011 0:00:00: 1214 06:00:00 PM: 9802 Min. :0 Mode:logical Min. : 0 N : 28082 COUNTYWIDE : 19731 Min. : 0.0 Min. : 0 Min. :0 Min. : 0 Min. : 0 Min. : 0.0 Min. : 0 K :424665 Min. : 0.0 K :281832 OUN : 17393 TEXAS, North : 12193 :205988 Min. : 0 Min. :-14451 Min. : 0 Min. :-14455 : 24013 Min. : 1
2 1st Qu.:19.0 4/27/2011 0:00:00: 1193 06:00:00 PM: 7350 EST :245558 1st Qu.: 31 WASHINGTON: 7603 KS : 53440 TSTM WIND :219940 1st Qu.: 0 W : 38446 Countywide : 993 5/25/2011 0:00:00: 1196 05:00:00 PM: 8314 1st Qu.:0 NA’s:902297 1st Qu.: 0 S : 22510 SOUTH PORTION : 833 1st Qu.: 0.0 1st Qu.: 0 1st Qu.:0 1st Qu.: 0 1st Qu.: 0 1st Qu.: 0.0 1st Qu.: 0 M : 11330 1st Qu.: 0.0 M : 1994 JAN : 13889 ARKANSAS, Central and North Central: 11738 GREATER RENO / CARSON CITY / M - GREATER RENO / CARSON CITY / M : 639 1st Qu.:2802 1st Qu.: 7247 1st Qu.: 0 1st Qu.: 0 Trees down. : 1110 1st Qu.:225575
3 Median :30.0 6/9/2011 0:00:00 : 1030 04:00:00 PM: 7261 MST : 68390 Median : 75 JACKSON : 6660 OK : 46802 THUNDERSTORM WIND: 82563 Median : 0 S : 37558 SPRINGFIELD : 843 6/9/2011 0:00:00 : 1021 04:00:00 PM: 8104 Median :0 Median : 0 W : 20119 NORTH PORTION : 780 Median : 0.0 Median : 0 Median :1 Median : 50 Median : 0 Median : 0.0 Median : 0 0 : 216 Median : 0.0 k : 21 LWX : 13174 IOWA, Central : 11345 GREATER LAKE TAHOE AREA - GREATER LAKE TAHOE AREA : 592 Median :3540 Median : 8707 Median : 0 Median : 0 Several trees were blown down. : 568 Median :451149
4 Mean :31.2 5/30/2004 0:00:00: 1016 05:00:00 PM: 6891 PST : 28302 Mean :101 FRANKLIN : 6256 MO : 35648 TORNADO : 60652 Mean : 1 E : 33178 SOUTH PORTION: 810 4/4/2011 0:00:00 : 1007 12:00:00 PM: 7483 Mean :0 Mean : 1 E : 20047 CENTRAL PORTION: 617 Mean : 0.2 Mean : 8 Mean :1 Mean : 47 Mean : 0 Mean : 0.2 Mean : 12 B : 40 Mean : 1.5 0 : 19 PHI : 12551 KANSAS, Southwest : 11212 JEFFERSON - JEFFERSON : 303 Mean :2875 Mean : 6940 Mean :1452 Mean : 3509 Trees were downed. : 446 Mean :451149
5 3rd Qu.:45.0 4/4/2011 0:00:00 : 1009 12:00:00 PM: 6703 AST : 6360 3rd Qu.:131 LINCOLN : 5937 IA : 31069 FLASH FLOOD : 54277 3rd Qu.: 1 NW : 24041 NORTH PORTION: 784 5/30/2004 0:00:00: 998 11:59:00 PM: 7184 3rd Qu.:0 3rd Qu.: 0 NE : 14606 SPRINGFIELD : 575 3rd Qu.: 0.0 3rd Qu.: 0 3rd Qu.:1 3rd Qu.: 75 3rd Qu.: 0 3rd Qu.: 0.0 3rd Qu.: 0 5 : 28 3rd Qu.: 0.0 B : 9 TSA : 12483 GEORGIA, North and Central : 11120 MADISON - MADISON : 302 3rd Qu.:4019 3rd Qu.: 9605 3rd Qu.:3549 3rd Qu.: 8735
Large trees and power lines were blown down.
432
3rd Qu.:676723
6 Max. :95.0 4/2/2006 0:00:00 : 981 03:00:00 PM: 6700 HST : 2563 Max. :873 (Other) :866412 NE : 30271 FLOOD : 25326 Max. :3749 (Other):134990 (Other) :591444 (Other) :653450 (Other) :622432 Max. :0 Max. :925 (Other): 72096 (Other) :380536 Max. :2315.0 Max. :4400 Max. :5 Max. :22000 Max. :583 Max. :1700.0 Max. :5000 (Other): 84 Max. :990.0 (Other): 9 (Other):690738 (Other) :595920 (Other) :100444 Max. :9706 Max. : 17124 Max. :9706 Max. :106220 (Other) :588295 Max. :902297
7 (Other) :895866 (Other) :857229 (Other): 3631 NA’s : 1589 (Other):621339 (Other) :170878 NA’s :547332 NA’s :287743 NA’s :243411 NA’s :238978 NA’s :724837 NA’s :499225 NA’s :843563 NA’s :465934 NA’s :618413 NA’s :142069 NA’s :248769 NA’s :594029 NA’s :47 NA’s :40 NA’s :287433

Since our focus is in the event type, we checked all available values in EVTYPE and comparing to the event names defined on Page 6 of the National Weather Service Storm Data Documentation. We can see that 985 unique values are pretty messy, and we reproduce below only firsts 25th names in order for you can observe this issue too.

uniqevtype <- unique(raw_data$EVTYPE)
length(uniqevtype) # Total events types unique values
head(sort(uniqevtype), n = 25)
## [1] 985
##  [1]    HIGH SURF ADVISORY        COASTAL FLOOD             
##  [3]  FLASH FLOOD                 LIGHTNING                 
##  [5]  TSTM WIND                   TSTM WIND (G45)           
##  [7]  WATERSPOUT                  WIND                      
##  [9] ?                           ABNORMAL WARMTH            
## [11] ABNORMALLY DRY              ABNORMALLY WET             
## [13] ACCUMULATED SNOWFALL        AGRICULTURAL FREEZE        
## [15] APACHE COUNTY               ASTRONOMICAL HIGH TIDE     
## [17] ASTRONOMICAL LOW TIDE       AVALANCE                   
## [19] AVALANCHE                   BEACH EROSIN               
## [21] Beach Erosion               BEACH EROSION              
## [23] BEACH EROSION/COASTAL FLOOD BEACH FLOOD                
## [25] BELOW NORMAL PRECIPITATION 
## 985 Levels:    HIGH SURF ADVISORY  COASTAL FLOOD ... WND

Since we need consider the magnitude for our cost, we need check how propdmgexp and cropdmgexp are filled.

unique(raw_data$PROPDMGEXP)
unique(raw_data$CROPDMGEXP)
##  [1] K    M    <NA> B    m    +    0    5    6    ?    4    2    3    h   
## [15] 7    H    -    1    8   
## Levels: - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
## [1] <NA> M    K    m    B    ?    0    k    2   
## Levels: ? 0 2 B k K m M

Data Preparation

Here we prossed with all data preparation. Fist we assign the data to a data frame variable named tidy, rename the columns names to lowercase for ease of coding and coerce the bgn_date to date class.

tidy <- raw_data
colnames(tidy) <- tolower(names(tidy))
tidy$bgn_date <- as.Date(tidy$bgn_date, format="%m/%e/%Y %H:%M:%S")

And then, we tidied the events by correcting spelling, remove unuseful characters (special characters, excessive spaces…), simplify names (e.g. remove some adjectives) and change event types to lowercase for easy handle.

# Replace 'and' or '&' with space 
tidy$evtype <- gsub("(\\sAND(\\s|$))|(\\s&\\s)", " ", tidy$evtype) 

# Remove leading and trailing space 
tidy$evtype <- gsub("^\\s+|\\s+$", "", tidy$evtype) 

# Remove multiple spaces 
tidy$evtype <- gsub("\\s{2,}", " ", tidy$evtype) 

# Lowercase Events Types
tidy$evtype <- tolower(tidy[, "evtype"])

# Spelling correction simplify names
tidy$evtype <- gsub("\\s{0,1}-\\s", "/", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("\\sand[[:alpha:]]{1,}", "/", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("\\sch([^[[:alpha:]]]{0,}|$)", " chill", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("\\schi([^[[:alpha:]]]{0,}|$)", " chill", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("\\schil($|\\s)", " chill", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("\\ssno($|\\s)", " snow", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("avalance", "avalanche", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("clou$", "cloud", tidy$evtype, ignore.case = TRUE) 
tidy$evtype <- gsub("drie", "dry", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("cstl", "coastal", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("erosin", "erosion", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("fld", "flood", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("floodin", "flood", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("flooding", "flood", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("freeze", "freezing", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("high wind (g40)", "high wind", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("ligntning", "lightning", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("micoburst", "microburst", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("non(-|\\s)[[:alpha:]]{1,}\\s(.*)", "\\2", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("precipatation", "precipitation", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("rip currents", "rip current", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("strm", "stream", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("tornadoes", "tornados", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("torndao", "tornados", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("tstm", "thunderstorm", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("tunderstorm", "thunderstorm", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("typhoo", "tornados", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("vog", "fog", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("waterspouts", "waterspout", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("wauseon", "", tidy$evtype, ignore.case = TRUE) 
tidy$evtype <- gsub("wayterspout", "waterspout", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("wi$|wi[^ndl]|win[^d]", "wind", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("winder", "winter", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("wintery", "winter", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("wintry", "winter", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("wnd", "wind", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("(hurricane|typhoon).*", "hurricane", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("(th?u?n?d?ee?r?|tstm).*", "thunderstorm wind", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("(tornado|torndao).*", "tornado", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("avalance.*", "avalanche", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("blizzard.*", "blizzard", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("coastal.*(flood|surge).*", "coastal flood", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("cold.*", "cold/wind chill", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("dust.*devel.*", "dust devil", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("dust.*storm.*", "dust storm", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("excessive.*heat.*", "excessive heat", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("extreme.*(code|wind).*", "extreme code/wind chill", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("flash.*flood.*|.*flood.*flash.*", "flash flood", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("funnel.*(cloud)*.*", "funnel cloud", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("hail.*", "hail", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("heat.*", "heat", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("heavy.*(rain|shower).*", "heavy rain", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("heavy.*snow.*", "heavy snow", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("high.*surf.*", "high surf", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("high.*wind.*", "high wind", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("ice.*storm.*", "ice storm", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("lake.*effect.*snow.*", "lake-effect snow", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("lig(h|n)tn?ing.*", "lightning", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("rip.*current.*", "rip current", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("seiche.*", "seiche", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("storm.*", "storm surge/tide", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("tropical.*storm.*", "tropical storm", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("volcanic.*", "volcanic ash", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("wild.*fire.*", "wildfire", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("summary.*", "summary", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("(.*flood.*|.*floooding.*)", "flood", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("(sleet.*|.*sleet.*|.*snow.*rain.*|.*rain.*snow.*)", "sleet", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("(.*free.*|.*frost.*)", "frost/freezing", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("(icy roads|.*snow.*|.*ice.*)", "ice/snow", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub(".*thunderstorm.*", "thunderstorm wind", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub(".*blizzard.*", "blizzard", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("(hurricane|gustnado)", "hurricane", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("(.*seas.*|high tides|blow-out tides|blow-out tide)", "heavy seas/tides ", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub(".*erosion.*", "costal erosion", tidy$evtype, ignore.case = TRUE) 
tidy$evtype <- sub(".*whirlwind*", "whirlwind", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub(".*landspout*", "landspout", tidy$evtype, ignore.case = TRUE)

Since we continue with a huge numbers of events types, probable we have better results if we group them. For this, we can see that National Climatic Data Center Storm Events FAQ mentioned that annual summary reports (see a eg at http://www.ncdc.noaa.gov/oa/climate/sd/annsum2009.pdf), This report groups the natural disaster events into seven category events in the following way:
- convection: lightning, tornado, thunderstorm wind and hail;
- extreme temperature: cold and heat; - flood: flash flood and river flood;
- marine: coastal storm, tsunami and rip current;
- tropical cyclones: tropical storm/hurricane;
- winter: winter storm, ice and avalance;
- other: drought, dust storm, dust devil, rain, fog, high wind, waterspout, fire weather mud slide, volcanic ash and miscellaneous.

We note that some events types are not mentioned and appear pertain to one of these first six categories. Since we note that we have much quality problems with the input, we made a second input review and reset these categories for it below:
- convection: lightning, tornado, thunderstorm, whirlwind, landspout and hail;
- extreme temperature: cold and heat;
- flood: flash flood and river flood;
- marine: coastal storm, tsunami and rip current;
- tropical cyclones: tropical storm/hurricane;
- winter: winter storm, ice, snow, sleet, frost/freezing, blizzard and avalance;
- other: all remaning events types.

Finaly, we include this group categories of events into our tidy dataset.

# create a new column for group events
tidy$evgroup <- NA

# 1) select convection group events and assign it to evgroup column.
tidy[grepl("lightning", tidy$evtype) | grepl("tornado", tidy$evtype) | grepl("thunderstorm", tidy$evtype) | grepl("whirlwind", tidy$evtype) | grepl("landspout", tidy$evtype) | grepl("hail", tidy$evtype), c("evgroup")] <- "convection"

# 2) select ext. temp group events and assign it to evgroup column.
tidy[grepl("cold", tidy$evtype) | grepl("heat", tidy$evtype), c("evgroup")] <- "ext. temp"

# 3) select flood group events and assign it to evgroup column.
tidy[grepl("flash flood", tidy$evtype) | grepl("flood", tidy$evtype) |grepl("river flood", tidy$evtype), c("evgroup")] <- "flood"

# 4) select marine group events and assign it to evgroup column.
tidy[grepl("coastal storm", tidy$evtype) | grepl("tsunami", tidy$evtype) | grepl("rip current", tidy$evtype), c("evgroup")] <- "marine"

# 5) select tropical cyclones group event and assign it to evgroup column.
tidy[grepl("tropical storm", tidy$evtype) | grepl("hurricane", tidy$evtype), c("evgroup")] <- "trop. cyclones"

# 6) select winter group events and assign it to evgroup column.
tidy[grepl("winter storm", tidy$evtype) | grepl("ice", tidy$evtype) | grepl("sleet", tidy$evtype) | grepl("frost/freezing", tidy$evtype) | grepl("ice/snow", tidy$evtype) | grepl("blizzard", tidy$evtype) | grepl("avalanche", tidy$evtype), c("evgroup")] <- "winter"

# 7) select other group events and assign it to evgroup column.
tidy[is.na(tidy$evgroup), "evgroup"] <- "other"

tidy <- tidy[tidy$bgn_date > as.Date("01/01/1996 0:00:00", "%m/%e/%Y %H:%M:%S"), ]

In a reviewed, we compare the annual summaries report with our data, and find at 2006 that total cost of Flash Flood and River Flood is 2,136.6 million and 1,726.3 million, but we find an entry in billions into our data so, we must replace the “b” by “m” on this record. See below this issue and the correction code:

# Show the events which input the property damage cost wrong in billion instead in million
tidy[grepl("115", tidy$propdmg) & grepl("32.5", tidy$cropdmg), c(2,8,25,26,27,28)]

# replace 'b' by 'm'
tidy[grepl("115", tidy$propdmg) & grepl("32.5", tidy$cropdmg) & 
             grepl("B", tidy$propdmgexp), c("propdmgexp")] <- "M"
##          bgn_date evtype propdmg propdmgexp cropdmg cropdmgexp
## 567251 2005-12-31  flood     115          M    32.5          M
## 605953 2006-01-01  flood     115          B    32.5          M

In the examination the metadata documentation (see: http://thmp.info/metadata/other-storms/Other%20Storm%20Events%201955-2003.html), we see that Prop. Damage Exp.: expressed in dollar unit type, so considered that k is 1,000, m is 1,000,000, b is 1,000,000,000 and others values is 1**.

We create the function below to apply this treatment at propdmgexp and cropdmgexp.

get_exponent <- function(PROPdmgEXP) {
        if (is.na(PROPdmgEXP))  exponent <- 1 
        else {
                exp <- toupper(PROPdmgEXP);
                if (exp == "K") { exponent <- 1000 }
                else if (exp == "M") { exponent <- 1000000 }
                else if (exp == "B") { exponent <- 1000000000 }
                else { exponent <- 1 }
        }
        exponent
}

So the total damage cost is the sum of the property damage cost, resulting from propdmg times propdmgexp, and crop damage cost, resulting from cropdmg times cropdmgexp.

tidy$damage <- with(tidy, propdmg * sapply(propdmgexp, get_exponent)) +
        with(tidy, cropdmg * sapply(cropdmgexp, get_exponent))

Next, we remove unuseful events from the data set to reduce the date size for easier handel, the reason for this is if an event had no fatalities and no injured and no crop or property damage, than for the purpose of this analysis it is useless.

# subset observations which have fatalities or injured or Damages
tidy <- subset(tidy, injuries > 0 | fatalities > 0 | damage > 0)

Also, we compute the total casualties and save it in a new variable called casualties.

tidy$casualties <- tidy$fatalities + tidy$injuries

Finally, we subset the columns which are useful for this analysis.

tidy <- tidy[tidy$evtype!="summary", c("bgn_date", "evgroup", "evtype", "fatalities", "injuries", "casualties", "damage")]

Results

Summary Count Events Group

We initiate with check the summary count of event groups table to see which event group have most observations.

xt <- table(tidy$evgroup)
xt
## 
##     convection      ext. temp          flood         marine          other 
##         153811            965          29519            621          13036 
## trop. cyclones         winter 
##            609           2752

From the table above, we can see the convection event category, which includes lightning, tornado, thunderstorm wind, hail, occurs 153,811, it is most frequncy compare to other event group and it is 5.21 times of second category flood.

Check worse Weather Event Group Types by Casualties and Damage Cost

We proced with focus analysis in the weather category events to identify which are the worse ones from a perspective of Casualties and Damage Cost. For that, we proced with an aggregation of the data by evgroup.

For plotting needs, we create two new datasets, both ordinate decrescent and also reset the factor of evgroup according these order.

### Aggregating data by weather category events
aggr_grp <- aggregate(cbind(fatalities, injuries, casualties, damage) ~ evgroup, data = tidy, FUN=sum, na.rm=TRUE)

worse_casualties <- aggr_grp[with(aggr_grp, order(-casualties)),]
worse_casualties$evgroup <-factor(worse_casualties$evgroup, levels=worse_casualties[order(worse_casualties$casualties), "evgroup"])

worse_damage <- aggr_grp[with(aggr_grp, order(-damage)),];
worse_damage$evgroup <-factor(worse_damage$evgroup, levels=worse_damage[order(worse_damage$damage), "evgroup"]);

aggr <- aggregate(cbind(fatalities, injuries, casualties, damage) ~ bgn_date + evgroup, data = tidy[tidy$bgn_date > as.Date("01/11/2006 0:00:00", "%m/%e/%Y %H:%M:%S"), ], FUN=sum, na.rm=TRUE);

aggr2 <- aggregate(damage ~ evgroup, data = aggr, FUN=sum, na.rm=TRUE);

Figure 1: Weather Category Events Across US:

pd1 <-  ggplot(worse_damage, aes(x=evgroup, y=damage)) +
        ggtitle("Economic Impact (1996 to 2011)") +
        xlab("") + ylab("damage Cost") +
        theme(text = element_text(size=8), 
              axis.text.y = element_text(size= 6)) +
        geom_bar(stat="identity") +  coord_flip()

pc1 <-  ggplot(worse_casualties, aes(x=evgroup, y=casualties)) +
        xlab("Weather Category Events") +  ylab("Casualties (fatalities + injuries)") +
        theme(text = element_text(size=8), 
              axis.text.y = element_text(size= 6)) +
        ggtitle("Causing Most Casualties (1996 to 2011)") +
        geom_bar(stat="identity") + coord_flip()


pt1 <-  qplot(bgn_date, damage, data=aggr, geom="line", group=evgroup, colour=evgroup, position="identity") +
        ggtitle("Weather Event Group Damage Cost (11/06 to 11/2011) Across US") +
        theme(text = element_text(size=8)) 
              

par(mfrow=c(2,2),mar=c(5.1,4.1,4.1,2.1), oma=c(0,0,0,0))

vp.setup(2,2)

print(pt1, vp=vp.layout(1, 1:2))
print(pd1, vp=vp.layout(2,1))
print(pc1, vp=vp.layout(2,2))

plot of chunk plot_worse_events_group

Figure 1: Weather Category Events Across US


From this graphic above we can see that convection represents the worst category from casualties perspective is convection with 35,311. It is 3.58 times of second category extreme temp, but if you consider only the fatalities see that difference decrease to 1.31 times of extreme temp.

From the total damage cost perspective, we can see that tropical cyclones events have the greatest economic consequences with 94.8 billion dollars. This largely due to the catastrophic events of the first half of 2006 as the New Orleans caused by Katrina. Note that this events are really a record. If we only consider the last 5 year, see the first line graph above, we note that convection and flood take first place with 29.5 and 26.3 billion dollars respectively.

Check worse Weather Event Types by Damage Cost, Casualties and Fatalities

Well, the other category appears like a big bag of cats, since it appears with extreme importance to damage cost, it is interesting make other analysis to check the worst weather events from its types.

So, we proced with focus analysis in the weather events types to identify which are the 10th worse ones from a perspective of Damage Cost, Casualties and fatalities. For that, we proced with an aggregation of the data by evtype.
For plotting needs, we create two new datasets, both ordinate decrescent and also reset the factor of evtype according these order.

### Aggregating data by event types
aggr_type <- aggregate(cbind(fatalities, injuries, casualties, damage) ~ evtype, data = tidy, FUN=sum, na.rm=TRUE)

### Getting the top ten Weather Event Types of Casualties and Damage Cost
top_damage <- aggr_type[with(aggr_type, order(-damage)),][1:10,]
top_damage$evtype <-factor(top_damage$evtype, levels=top_damage[order(top_damage$damage), "evtype"])

top_casualties <- aggr_type[with(aggr_type, order(-casualties)),][1:10,]
top_casualties$evtype <-factor(top_casualties$evtype, levels=top_casualties[order(top_casualties$casualties), "evtype"])

top_fatalities <- aggr_type[with(aggr_type, order(-fatalities)),][1:10,]
top_fatalities$evtype <-factor(top_fatalities$evtype, levels=top_fatalities[order(top_fatalities$fatalities), "evtype"])

Figure 2: The 10th Worst Weather Event Types Across US:

pd2 <- ggplot(top_damage, aes(x=evtype, y=damage)) +
        ggtitle("Having Greatest Economic Impact") +
        xlab("Weather Event Type") +  ylab("damage Cost") +
        theme(text = element_text(size=8), 
              axis.text.y = element_text(size= 6)) +
        geom_bar(stat="identity") +  coord_flip()

pc2 <- ggplot(top_casualties, aes(x=evtype, y=casualties)) +
        xlab("") + ylab("Casualties (fatalities + injuries)") +
        theme(text = element_text(size=8), 
              axis.text.y = element_text(size= 6)) +
        ggtitle("Causing Most Casualties") +
        geom_bar(stat="identity") + coord_flip()

pf2 <- ggplot(top_fatalities, aes(x=evtype, y=fatalities)) +
        ggtitle("Causing Most Fatalities") +
        xlab("") + ylab("Fatalities") +
        theme(text = element_text(size=8), 
              axis.text.y = element_text(size= 6)) +
        geom_bar(stat="identity") +  coord_flip()

par(mar=c(5.1,4.1,4.1,2.1), oma=c(0,0,0,0))

grid.arrange(pd2, pc2, pf2, nrow = 1, ncol=3) 
plot of chunk plot_worse_events_types
Figure 2: The 10th Worst Weather Event Types Across US


From this graphic above we can see that tornado represents the worst eventtype from casualties perspective is tornado with 22,183. It is 2.25 times of second event type flood. On the other hand, if you consider only the fatalities, the tornados surge as the second and excessive heat takes the worst place with 18.9% more deaths than tornados.

From the total damage cost perspective, we can see that hurricane events have the greatest economic consequences with 86.5 billion dollars.

Conclusion:

From casualties perspective, we can see that:

From the total damage cost perspective, we can see that:

So, tornado is the most health harmful, but excessive heat cause the greatest mortalities and hurricane have the greatest economic consequences

The Worst Weather Events across US to the most health harmful is tornado, but excessive heat cause the greatest mortalities and hurricane have the greatest economic consequences considering the record in 2006.


### Publish references:### Reproducible Research: Peer Assessment 2 from Coursera by Johns Hopkins University
Online publish in: http://rpubs.com/mgmarques/NOAA_Analysis
github repo with RMarkdown source code: https://github.com/mgmarques/NOAA_Analysis