Tornado is the most health harmful Weather Event Type, but Excessive Heat cause the greatest mortalities and Hurricane has the greatest economic impacts.

Synopsis
Prepare the Environment
Data Processing
Results
Conclusion:

Synopsis

This report presents the worst weather events across the United States. It is considering their effects in harmful with respect to population health and what this have the greatest economic consequences.

To achieve that, we obtain the storm data from U.S. National Oceanic and Atmospheric Administration (NOAA). We process these data by tidying, in special the event types that have too much data quality issues. We considering a subset of the data since 1996, when we have all the event types recorded in the database, sum the fatalities with injuries to obtain the total casualties and calculate the damage cost consider the magnitude and property damage cost and crop damage cost.

Finally, our results indicate that convection events category are the most harmful with respect to population health, in special tornado, that belong this category, is the worst event type. On the other hand, we need pay more attention with first place of the mortalities caused by excessive heat event type. Also, the tropical cyclones events category have the greatest economic consequences considering the record in 2006, in special the hurricane event types that belong this category, is the greatest ones followed by flood.

Prepare the Environment

Throughout this report you can always find the code that I used to generate my output presents here. When writing code chunks in the R markdown document, always use echo = TRUE so that someone else will be able to read the code. This assignment will be evaluated via peer assessment so it’s essential that my peer evaluators be able to review my code and my analysis together..

First, we set echo equal a TRUE and results equal a ‘hold’ as global options for this document.

library(knitr)

## Warning: package 'knitr' was built under R version 3.1.1

opts_chunk$set(echo = TRUE, results = 'hold')

Next, we load the libraries used in our analysis: ### Load data.table, xtable and ggplot2 libraries

library(data.table)
library(xtable)
library(ggplot2) # we shall use ggplot2 for plotting figures
library(gridExtra)

## Warning: package 'gridExtra' was built under R version 3.1.1

## Loading required package: grid

We don’t use any random numbers or functions that use it, so we don’t need set a seed.

Next, we show our system environment, maybe it is useful for reproducible proposals:

sessionInfo()

## R version 3.1.0 Patched (2014-04-10 r65396)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## 
## locale:
## [1] LC_COLLATE=Portuguese_Brazil.1252  LC_CTYPE=Portuguese_Brazil.1252   
## [3] LC_MONETARY=Portuguese_Brazil.1252 LC_NUMERIC=C                      
## [5] LC_TIME=Portuguese_Brazil.1252    
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] gridExtra_0.9.1  ggplot2_1.0.0    xtable_1.7-3     data.table_1.9.2
## [5] knitr_1.6       
## 
## loaded via a namespace (and not attached):
##  [1] colorspace_1.2-4 digest_0.6.4     evaluate_0.5.5   formatR_0.10    
##  [5] gtable_0.1.2     htmltools_0.2.4  MASS_7.3-31      munsell_0.4.2   
##  [9] plyr_1.8.1       proto_0.3-10     Rcpp_0.11.2      reshape2_1.4    
## [13] rmarkdown_0.2.46 scales_0.2.4     stringr_0.6.2    tools_3.1.0     
## [17] yaml_2.1.13

Data Processing

This assignment makes use of Storm Data in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size.

There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.

National Weather Service Storm Data Documentation.
National Climatic Data Center Storm Events FAQ

The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.

Data Loading

This assignment instructions request to show any code that is needed to loading and preprocessing the data, like to:

Load the data (i.e. > read.csv())
Process/transform the data (if necessary) into a format suitable for your analysis

Function definitions and load libraries code segment

First we define a function to check if the files exists in the path defined by file_path. If don’t exists stops execution of the current expression, and executes an error action.

check_file_exist <- function(file_path) 
{
        if (!file.exists(file_path))
                stop("The ", file_path, " not found!") else TRUE 
}

Next, we use data set and data_dir to define the file_path, call the check_file_exist function to check, if the file exist send a message to user for waiting the load of file. Finally returned with data set load to data variable.

load_data <- function(data_dir , fileURL, fileSource) 
{
        # Dataset check and load 
        
        file_path <- paste(data_dir, "/", fileSource , sep="")
        
        if (!file.exists(file_path)) {
                message(paste("Please Wait! Download...", fileURL, "..."));
                download.file(fileURL, destfile=source_path);
        } 

        data <- read.csv(bzfile(file_path),
                         header=TRUE,  na.strings=c("NA",""))
        
}

Assign the directory of this analysis

Maybe you need to change this data_dir variable to yours peeress directory, because the line code that ask you inform where the data directory is find, use readline function, not function at markdown documents.

data_dir <- "C:/Users/Marcelo/NOAA_Analysis/Data";

Here, we initiate the main variables and run data load and preparation process for the activate data.

fileURL <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2" 
fileSource <-"repdata-data-StormData.csv.bz2"

raw_data <- load_data(data_dir , fileURL, fileSource)

At last, we define some functions to simplify the actions for multiple plots in one graphic.

# define function to create multi-plot setup (nrow, ncol)
vp.setup <- function(x,y){

# create a new layout with grid
grid.newpage()

# define viewports and assign it to grid layout
pushViewport(viewport(layout = grid.layout(x,y)))
}

# define function to easily access layout (row, col)
vp.layout <- function(x,y){
viewport(layout.pos.row=x, layout.pos.col=y)
}

Data Exploration

Now, we can proceed with the data pre-examination, first we examine the str:

str(raw_data)

## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29600 levels "5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13512 1872 4597 10591 4371 10093 1972 23872 24417 4597 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 34 levels "  N"," NW","E",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ BGN_LOCATI: Factor w/ 54428 levels "- 1 N Albion",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ END_DATE  : Factor w/ 6662 levels "1/1/1993 0:00:00",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ END_TIME  : Factor w/ 3646 levels " 0900CST"," 200CST",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 23 levels "E","ENE","ESE",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ END_LOCATI: Factor w/ 34505 levels "- .5 NNW","- 11 ESE Jay",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 18 levels "-","?","+","0",..: 16 16 16 16 16 16 16 16 16 16 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 8 levels "?","0","2","B",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ WFO       : Factor w/ 541 levels " CI","$AC","$AG",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ STATEOFFIC: Factor w/ 249 levels "ALABAMA, Central",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ ZONENAMES : Factor w/ 25111 levels "                                                                                                                               "| __truncated__,..: NA NA NA NA NA NA NA NA NA NA ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436780 levels "-2 at Deer Park\n",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

Then we chech its summary.

xt <- xtable(summary(raw_data))
print(xt, type = "html")

	STATE__	BGN_DATE	BGN_TIME	TIME_ZONE	COUNTY	COUNTYNAME	STATE	EVTYPE	BGN_RANGE	BGN_AZI	BGN_LOCATI	END_DATE	END_TIME	COUNTY_END	COUNTYENDN	END_RANGE	END_AZI	END_LOCATI	LENGTH	WIDTH	F	MAG	FATALITIES	INJURIES	PROPDMG	PROPDMGEXP	CROPDMG	CROPDMGEXP	WFO	STATEOFFIC	ZONENAMES	LATITUDE	LONGITUDE	LATITUDE_E	LONGITUDE_	REMARKS	REFNUM
1	Min. : 1.0	5/25/2011 0:00:00: 1202	12:00:00 AM: 10163	CST :547493	Min. : 0	JEFFERSON : 7840	TX : 83728	HAIL :288661	Min. : 0	N : 86752	COUNTYWIDE : 19680	4/27/2011 0:00:00: 1214	06:00:00 PM: 9802	Min. :0	Mode:logical	Min. : 0	N : 28082	COUNTYWIDE : 19731	Min. : 0.0	Min. : 0	Min. :0	Min. : 0	Min. : 0	Min. : 0.0	Min. : 0	K :424665	Min. : 0.0	K :281832	OUN : 17393	TEXAS, North : 12193	:205988	Min. : 0	Min. :-14451	Min. : 0	Min. :-14455	: 24013	Min. : 1
2	1st Qu.:19.0	4/27/2011 0:00:00: 1193	06:00:00 PM: 7350	EST :245558	1st Qu.: 31	WASHINGTON: 7603	KS : 53440	TSTM WIND :219940	1st Qu.: 0	W : 38446	Countywide : 993	5/25/2011 0:00:00: 1196	05:00:00 PM: 8314	1st Qu.:0	NA’s:902297	1st Qu.: 0	S : 22510	SOUTH PORTION : 833	1st Qu.: 0.0	1st Qu.: 0	1st Qu.:0	1st Qu.: 0	1st Qu.: 0	1st Qu.: 0.0	1st Qu.: 0	M : 11330	1st Qu.: 0.0	M : 1994	JAN : 13889	ARKANSAS, Central and North Central: 11738	GREATER RENO / CARSON CITY / M - GREATER RENO / CARSON CITY / M : 639	1st Qu.:2802	1st Qu.: 7247	1st Qu.: 0	1st Qu.: 0	Trees down. : 1110	1st Qu.:225575
3	Median :30.0	6/9/2011 0:00:00 : 1030	04:00:00 PM: 7261	MST : 68390	Median : 75	JACKSON : 6660	OK : 46802	THUNDERSTORM WIND: 82563	Median : 0	S : 37558	SPRINGFIELD : 843	6/9/2011 0:00:00 : 1021	04:00:00 PM: 8104	Median :0		Median : 0	W : 20119	NORTH PORTION : 780	Median : 0.0	Median : 0	Median :1	Median : 50	Median : 0	Median : 0.0	Median : 0	0 : 216	Median : 0.0	k : 21	LWX : 13174	IOWA, Central : 11345	GREATER LAKE TAHOE AREA - GREATER LAKE TAHOE AREA : 592	Median :3540	Median : 8707	Median : 0	Median : 0	Several trees were blown down. : 568	Median :451149
4	Mean :31.2	5/30/2004 0:00:00: 1016	05:00:00 PM: 6891	PST : 28302	Mean :101	FRANKLIN : 6256	MO : 35648	TORNADO : 60652	Mean : 1	E : 33178	SOUTH PORTION: 810	4/4/2011 0:00:00 : 1007	12:00:00 PM: 7483	Mean :0		Mean : 1	E : 20047	CENTRAL PORTION: 617	Mean : 0.2	Mean : 8	Mean :1	Mean : 47	Mean : 0	Mean : 0.2	Mean : 12	B : 40	Mean : 1.5	0 : 19	PHI : 12551	KANSAS, Southwest : 11212	JEFFERSON - JEFFERSON : 303	Mean :2875	Mean : 6940	Mean :1452	Mean : 3509	Trees were downed. : 446	Mean :451149
5	3rd Qu.:45.0	4/4/2011 0:00:00 : 1009	12:00:00 PM: 6703	AST : 6360	3rd Qu.:131	LINCOLN : 5937	IA : 31069	FLASH FLOOD : 54277	3rd Qu.: 1	NW : 24041	NORTH PORTION: 784	5/30/2004 0:00:00: 998	11:59:00 PM: 7184	3rd Qu.:0		3rd Qu.: 0	NE : 14606	SPRINGFIELD : 575	3rd Qu.: 0.0	3rd Qu.: 0	3rd Qu.:1	3rd Qu.: 75	3rd Qu.: 0	3rd Qu.: 0.0	3rd Qu.: 0	5 : 28	3rd Qu.: 0.0	B : 9	TSA : 12483	GEORGIA, North and Central : 11120	MADISON - MADISON : 302	3rd Qu.:4019	3rd Qu.: 9605	3rd Qu.:3549	3rd Qu.: 8735	Large trees and power lines were blown down. 432	3rd Qu.:676723
6	Max. :95.0	4/2/2006 0:00:00 : 981	03:00:00 PM: 6700	HST : 2563	Max. :873	(Other) :866412	NE : 30271	FLOOD : 25326	Max. :3749	(Other):134990	(Other) :591444	(Other) :653450	(Other) :622432	Max. :0		Max. :925	(Other): 72096	(Other) :380536	Max. :2315.0	Max. :4400	Max. :5	Max. :22000	Max. :583	Max. :1700.0	Max. :5000	(Other): 84	Max. :990.0	(Other): 9	(Other):690738	(Other) :595920	(Other) :100444	Max. :9706	Max. : 17124	Max. :9706	Max. :106220	(Other) :588295	Max. :902297
7		(Other) :895866	(Other) :857229	(Other): 3631		NA’s : 1589	(Other):621339	(Other) :170878		NA’s :547332	NA’s :287743	NA’s :243411	NA’s :238978				NA’s :724837	NA’s :499225			NA’s :843563					NA’s :465934		NA’s :618413	NA’s :142069	NA’s :248769	NA’s :594029	NA’s :47		NA’s :40		NA’s :287433

Since our focus is in the event type, we checked all available values in EVTYPE and comparing to the event names defined on Page 6 of the National Weather Service Storm Data Documentation. We can see that 985 unique values are pretty messy, and we reproduce below only firsts 25th names in order for you can observe this issue too.

uniqevtype <- unique(raw_data$EVTYPE)
length(uniqevtype) # Total events types unique values
head(sort(uniqevtype), n = 25)

## [1] 985

##  [1]    HIGH SURF ADVISORY        COASTAL FLOOD             
##  [3]  FLASH FLOOD                 LIGHTNING                 
##  [5]  TSTM WIND                   TSTM WIND (G45)           
##  [7]  WATERSPOUT                  WIND                      
##  [9] ?                           ABNORMAL WARMTH            
## [11] ABNORMALLY DRY              ABNORMALLY WET             
## [13] ACCUMULATED SNOWFALL        AGRICULTURAL FREEZE        
## [15] APACHE COUNTY               ASTRONOMICAL HIGH TIDE     
## [17] ASTRONOMICAL LOW TIDE       AVALANCE                   
## [19] AVALANCHE                   BEACH EROSIN               
## [21] Beach Erosion               BEACH EROSION              
## [23] BEACH EROSION/COASTAL FLOOD BEACH FLOOD                
## [25] BELOW NORMAL PRECIPITATION 
## 985 Levels:    HIGH SURF ADVISORY  COASTAL FLOOD ... WND

Since we need consider the magnitude for our cost, we need check how propdmgexp and cropdmgexp are filled.

unique(raw_data$PROPDMGEXP)
unique(raw_data$CROPDMGEXP)

##  [1] K    M    <NA> B    m    +    0    5    6    ?    4    2    3    h   
## [15] 7    H    -    1    8   
## Levels: - ? + 0 1 2 3 4 5 6 7 8 B h H K m M

## [1] <NA> M    K    m    B    ?    0    k    2   
## Levels: ? 0 2 B k K m M

Data Preparation

Here we prossed with all data preparation. Fist we assign the data to a data frame variable named tidy, rename the columns names to lowercase for ease of coding and coerce the bgn_date to date class.

tidy <- raw_data
colnames(tidy) <- tolower(names(tidy))
tidy$bgn_date <- as.Date(tidy$bgn_date, format="%m/%e/%Y %H:%M:%S")

And then, we tidied the events by correcting spelling, remove unuseful characters (special characters, excessive spaces…), simplify names (e.g. remove some adjectives) and change event types to lowercase for easy handle.

# Replace 'and' or '&' with space 
tidy$evtype <- gsub("(\\sAND(\\s|$))|(\\s&\\s)", " ", tidy$evtype) 

# Remove leading and trailing space 
tidy$evtype <- gsub("^\\s+|\\s+$", "", tidy$evtype) 

# Remove multiple spaces 
tidy$evtype <- gsub("\\s{2,}", " ", tidy$evtype) 

# Lowercase Events Types
tidy$evtype <- tolower(tidy[, "evtype"])

# Spelling correction simplify names
tidy$evtype <- gsub("\\s{0,1}-\\s", "/", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("\\sand[[:alpha:]]{1,}", "/", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("\\sch([^[[:alpha:]]]{0,}|$)", " chill", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("\\schi([^[[:alpha:]]]{0,}|$)", " chill", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("\\schil($|\\s)", " chill", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("\\ssno($|\\s)", " snow", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("avalance", "avalanche", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("clou$", "cloud", tidy$evtype, ignore.case = TRUE) 
tidy$evtype <- gsub("drie", "dry", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("cstl", "coastal", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("erosin", "erosion", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("fld", "flood", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("floodin", "flood", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("flooding", "flood", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("freeze", "freezing", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("high wind (g40)", "high wind", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("ligntning", "lightning", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("micoburst", "microburst", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("non(-|\\s)[[:alpha:]]{1,}\\s(.*)", "\\2", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("precipatation", "precipitation", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("rip currents", "rip current", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("strm", "stream", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("tornadoes", "tornados", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("torndao", "tornados", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("tstm", "thunderstorm", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("tunderstorm", "thunderstorm", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("typhoo", "tornados", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("vog", "fog", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("waterspouts", "waterspout", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("wauseon", "", tidy$evtype, ignore.case = TRUE) 
tidy$evtype <- gsub("wayterspout", "waterspout", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("wi$|wi[^ndl]|win[^d]", "wind", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("winder", "winter", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("wintery", "winter", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("wintry", "winter", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("wnd", "wind", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("(hurricane|typhoon).*", "hurricane", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("(th?u?n?d?ee?r?|tstm).*", "thunderstorm wind", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("(tornado|torndao).*", "tornado", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("avalance.*", "avalanche", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("blizzard.*", "blizzard", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("coastal.*(flood|surge).*", "coastal flood", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("cold.*", "cold/wind chill", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("dust.*devel.*", "dust devil", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("dust.*storm.*", "dust storm", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("excessive.*heat.*", "excessive heat", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("extreme.*(code|wind).*", "extreme code/wind chill", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("flash.*flood.*|.*flood.*flash.*", "flash flood", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("funnel.*(cloud)*.*", "funnel cloud", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("hail.*", "hail", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("heat.*", "heat", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("heavy.*(rain|shower).*", "heavy rain", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("heavy.*snow.*", "heavy snow", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("high.*surf.*", "high surf", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("high.*wind.*", "high wind", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("ice.*storm.*", "ice storm", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("lake.*effect.*snow.*", "lake-effect snow", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("lig(h|n)tn?ing.*", "lightning", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("rip.*current.*", "rip current", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("seiche.*", "seiche", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("storm.*", "storm surge/tide", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("tropical.*storm.*", "tropical storm", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("volcanic.*", "volcanic ash", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("wild.*fire.*", "wildfire", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("summary.*", "summary", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("(.*flood.*|.*floooding.*)", "flood", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("(sleet.*|.*sleet.*|.*snow.*rain.*|.*rain.*snow.*)", "sleet", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("(.*free.*|.*frost.*)", "frost/freezing", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("(icy roads|.*snow.*|.*ice.*)", "ice/snow", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub(".*thunderstorm.*", "thunderstorm wind", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub(".*blizzard.*", "blizzard", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("(hurricane|gustnado)", "hurricane", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("(.*seas.*|high tides|blow-out tides|blow-out tide)", "heavy seas/tides ", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub(".*erosion.*", "costal erosion", tidy$evtype, ignore.case = TRUE) 
tidy$evtype <- sub(".*whirlwind*", "whirlwind", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub(".*landspout*", "landspout", tidy$evtype, ignore.case = TRUE)

Since we continue with a huge numbers of events types, probable we have better results if we group them. For this, we can see that National Climatic Data Center Storm Events FAQ mentioned that annual summary reports (see a eg at http://www.ncdc.noaa.gov/oa/climate/sd/annsum2009.pdf), This report groups the natural disaster events into seven category events in the following way:
- convection: lightning, tornado, thunderstorm wind and hail;
- extreme temperature: cold and heat; - flood: flash flood and river flood;
- marine: coastal storm, tsunami and rip current;
- tropical cyclones: tropical storm/hurricane;
- winter: winter storm, ice and avalance;
- other: drought, dust storm, dust devil, rain, fog, high wind, waterspout, fire weather mud slide, volcanic ash and miscellaneous.

We note that some events types are not mentioned and appear pertain to one of these first six categories. Since we note that we have much quality problems with the input, we made a second input review and reset these categories for it below:
- convection: lightning, tornado, thunderstorm, whirlwind, landspout and hail;
- extreme temperature: cold and heat;
- flood: flash flood and river flood;
- marine: coastal storm, tsunami and rip current;
- tropical cyclones: tropical storm/hurricane;
- winter: winter storm, ice, snow, sleet, frost/freezing, blizzard and avalance;
- other: all remaning events types.

Finaly, we include this group categories of events into our tidy dataset.

# create a new column for group events
tidy$evgroup <- NA

# 1) select convection group events and assign it to evgroup column.
tidy[grepl("lightning", tidy$evtype) | grepl("tornado", tidy$evtype) | grepl("thunderstorm", tidy$evtype) | grepl("whirlwind", tidy$evtype) | grepl("landspout", tidy$evtype) | grepl("hail", tidy$evtype), c("evgroup")] <- "convection"

# 2) select ext. temp group events and assign it to evgroup column.
tidy[grepl("cold", tidy$evtype) | grepl("heat", tidy$evtype), c("evgroup")] <- "ext. temp"

# 3) select flood group events and assign it to evgroup column.
tidy[grepl("flash flood", tidy$evtype) | grepl("flood", tidy$evtype) |grepl("river flood", tidy$evtype), c("evgroup")] <- "flood"

# 4) select marine group events and assign it to evgroup column.
tidy[grepl("coastal storm", tidy$evtype) | grepl("tsunami", tidy$evtype) | grepl("rip current", tidy$evtype), c("evgroup")] <- "marine"

# 5) select tropical cyclones group event and assign it to evgroup column.
tidy[grepl("tropical storm", tidy$evtype) | grepl("hurricane", tidy$evtype), c("evgroup")] <- "trop. cyclones"

# 6) select winter group events and assign it to evgroup column.
tidy[grepl("winter storm", tidy$evtype) | grepl("ice", tidy$evtype) | grepl("sleet", tidy$evtype) | grepl("frost/freezing", tidy$evtype) | grepl("ice/snow", tidy$evtype) | grepl("blizzard", tidy$evtype) | grepl("avalanche", tidy$evtype), c("evgroup")] <- "winter"

# 7) select other group events and assign it to evgroup column.
tidy[is.na(tidy$evgroup), "evgroup"] <- "other"

tidy <- tidy[tidy$bgn_date > as.Date("01/01/1996 0:00:00", "%m/%e/%Y %H:%M:%S"), ]

In a reviewed, we compare the annual summaries report with our data, and find at 2006 that total cost of Flash Flood and River Flood is 2,136.6 million and 1,726.3 million, but we find an entry in billions into our data so, we must replace the “b” by “m” on this record. See below this issue and the correction code:

# Show the events which input the property damage cost wrong in billion instead in million
tidy[grepl("115", tidy$propdmg) & grepl("32.5", tidy$cropdmg), c(2,8,25,26,27,28)]

# replace 'b' by 'm'
tidy[grepl("115", tidy$propdmg) & grepl("32.5", tidy$cropdmg) & 
             grepl("B", tidy$propdmgexp), c("propdmgexp")] <- "M"

##          bgn_date evtype propdmg propdmgexp cropdmg cropdmgexp
## 567251 2005-12-31  flood     115          M    32.5          M
## 605953 2006-01-01  flood     115          B    32.5          M

In the examination the metadata documentation (see: http://thmp.info/metadata/other-storms/Other%20Storm%20Events%201955-2003.html), we see that Prop. Damage Exp.: expressed in dollar unit type, so considered that k is 1,000, m is 1,000,000, b is 1,000,000,000 and others values is 1**.

We create the function below to apply this treatment at propdmgexp and cropdmgexp.

get_exponent <- function(PROPdmgEXP) {
        if (is.na(PROPdmgEXP))  exponent <- 1 
        else {
                exp <- toupper(PROPdmgEXP);
                if (exp == "K") { exponent <- 1000 }
                else if (exp == "M") { exponent <- 1000000 }
                else if (exp == "B") { exponent <- 1000000000 }
                else { exponent <- 1 }
        }
        exponent
}

So the total damage cost is the sum of the property damage cost, resulting from propdmg times propdmgexp, and crop damage cost, resulting from cropdmg times cropdmgexp.

tidy$damage <- with(tidy, propdmg * sapply(propdmgexp, get_exponent)) +
        with(tidy, cropdmg * sapply(cropdmgexp, get_exponent))

Next, we remove unuseful events from the data set to reduce the date size for easier handel, the reason for this is if an event had no fatalities and no injured and no crop or property damage, than for the purpose of this analysis it is useless.

# subset observations which have fatalities or injured or Damages
tidy <- subset(tidy, injuries > 0 | fatalities > 0 | damage > 0)

Also, we compute the total casualties and save it in a new variable called casualties.

tidy$casualties <- tidy$fatalities + tidy$injuries

Finally, we subset the columns which are useful for this analysis.

tidy <- tidy[tidy$evtype!="summary", c("bgn_date", "evgroup", "evtype", "fatalities", "injuries", "casualties", "damage")]

Results

Summary Count Events Group

We initiate with check the summary count of event groups table to see which event group have most observations.

xt <- table(tidy$evgroup)
xt

## 
##     convection      ext. temp          flood         marine          other 
##         153811            965          29519            621          13036 
## trop. cyclones         winter 
##            609           2752

From the table above, we can see the convection event category, which includes lightning, tornado, thunderstorm wind, hail, occurs 153,811, it is most frequncy compare to other event group and it is 5.21 times of second category flood.

Check worse Weather Event Group Types by Casualties and Damage Cost

We proced with focus analysis in the weather category events to identify which are the worse ones from a perspective of Casualties and Damage Cost. For that, we proced with an aggregation of the data by evgroup.

For plotting needs, we create two new datasets, both ordinate decrescent and also reset the factor of evgroup according these order.

### Aggregating data by weather category events
aggr_grp <- aggregate(cbind(fatalities, injuries, casualties, damage) ~ evgroup, data = tidy, FUN=sum, na.rm=TRUE)

worse_casualties <- aggr_grp[with(aggr_grp, order(-casualties)),]
worse_casualties$evgroup <-factor(worse_casualties$evgroup, levels=worse_casualties[order(worse_casualties$casualties), "evgroup"])

worse_damage <- aggr_grp[with(aggr_grp, order(-damage)),];
worse_damage$evgroup <-factor(worse_damage$evgroup, levels=worse_damage[order(worse_damage$damage), "evgroup"]);

aggr <- aggregate(cbind(fatalities, injuries, casualties, damage) ~ bgn_date + evgroup, data = tidy[tidy$bgn_date > as.Date("01/11/2006 0:00:00", "%m/%e/%Y %H:%M:%S"), ], FUN=sum, na.rm=TRUE);

aggr2 <- aggregate(damage ~ evgroup, data = aggr, FUN=sum, na.rm=TRUE);

Figure 1: Weather Category Events Across US:

pd1 <-  ggplot(worse_damage, aes(x=evgroup, y=damage)) +
        ggtitle("Economic Impact (1996 to 2011)") +
        xlab("") + ylab("damage Cost") +
        theme(text = element_text(size=8), 
              axis.text.y = element_text(size= 6)) +
        geom_bar(stat="identity") +  coord_flip()

pc1 <-  ggplot(worse_casualties, aes(x=evgroup, y=casualties)) +
        xlab("Weather Category Events") +  ylab("Casualties (fatalities + injuries)") +
        theme(text = element_text(size=8), 
              axis.text.y = element_text(size= 6)) +
        ggtitle("Causing Most Casualties (1996 to 2011)") +
        geom_bar(stat="identity") + coord_flip()


pt1 <-  qplot(bgn_date, damage, data=aggr, geom="line", group=evgroup, colour=evgroup, position="identity") +
        ggtitle("Weather Event Group Damage Cost (11/06 to 11/2011) Across US") +
        theme(text = element_text(size=8)) 
              

par(mfrow=c(2,2),mar=c(5.1,4.1,4.1,2.1), oma=c(0,0,0,0))

vp.setup(2,2)

print(pt1, vp=vp.layout(1, 1:2))
print(pd1, vp=vp.layout(2,1))
print(pc1, vp=vp.layout(2,2))

plot of chunk plot_worse_events_group

Figure 1: Weather Category Events Across US

From this graphic above we can see that convection represents the worst category from casualties perspective is convection with 35,311. It is 3.58 times of second category extreme temp, but if you consider only the fatalities see that difference decrease to 1.31 times of extreme temp.

From the total damage cost perspective, we can see that tropical cyclones events have the greatest economic consequences with 94.8 billion dollars. This largely due to the catastrophic events of the first half of 2006 as the New Orleans caused by Katrina. Note that this events are really a record. If we only consider the last 5 year, see the first line graph above, we note that convection and flood take first place with 29.5 and 26.3 billion dollars respectively.

Check worse Weather Event Types by Damage Cost, Casualties and Fatalities

Well, the other category appears like a big bag of cats, since it appears with extreme importance to damage cost, it is interesting make other analysis to check the worst weather events from its types.

So, we proced with focus analysis in the weather events types to identify which are the 10th worse ones from a perspective of Damage Cost, Casualties and fatalities. For that, we proced with an aggregation of the data by evtype.
For plotting needs, we create two new datasets, both ordinate decrescent and also reset the factor of evtype according these order.

### Aggregating data by event types
aggr_type <- aggregate(cbind(fatalities, injuries, casualties, damage) ~ evtype, data = tidy, FUN=sum, na.rm=TRUE)

### Getting the top ten Weather Event Types of Casualties and Damage Cost
top_damage <- aggr_type[with(aggr_type, order(-damage)),][1:10,]
top_damage$evtype <-factor(top_damage$evtype, levels=top_damage[order(top_damage$damage), "evtype"])

top_casualties <- aggr_type[with(aggr_type, order(-casualties)),][1:10,]
top_casualties$evtype <-factor(top_casualties$evtype, levels=top_casualties[order(top_casualties$casualties), "evtype"])

top_fatalities <- aggr_type[with(aggr_type, order(-fatalities)),][1:10,]
top_fatalities$evtype <-factor(top_fatalities$evtype, levels=top_fatalities[order(top_fatalities$fatalities), "evtype"])

Figure 2: The 10th Worst Weather Event Types Across US:

pd2 <- ggplot(top_damage, aes(x=evtype, y=damage)) +
        ggtitle("Having Greatest Economic Impact") +
        xlab("Weather Event Type") +  ylab("damage Cost") +
        theme(text = element_text(size=8), 
              axis.text.y = element_text(size= 6)) +
        geom_bar(stat="identity") +  coord_flip()

pc2 <- ggplot(top_casualties, aes(x=evtype, y=casualties)) +
        xlab("") + ylab("Casualties (fatalities + injuries)") +
        theme(text = element_text(size=8), 
              axis.text.y = element_text(size= 6)) +
        ggtitle("Causing Most Casualties") +
        geom_bar(stat="identity") + coord_flip()

pf2 <- ggplot(top_fatalities, aes(x=evtype, y=fatalities)) +
        ggtitle("Causing Most Fatalities") +
        xlab("") + ylab("Fatalities") +
        theme(text = element_text(size=8), 
              axis.text.y = element_text(size= 6)) +
        geom_bar(stat="identity") +  coord_flip()

par(mar=c(5.1,4.1,4.1,2.1), oma=c(0,0,0,0))

grid.arrange(pd2, pc2, pf2, nrow = 1, ncol=3)

Figure 2: The 10th Worst Weather Event Types Across US

From this graphic above we can see that tornado represents the worst eventtype from casualties perspective is tornado with 22,183. It is 2.25 times of second event type flood. On the other hand, if you consider only the fatalities, the tornados surge as the second and excessive heat takes the worst place with 18.9% more deaths than tornados.

From the total damage cost perspective, we can see that hurricane events have the greatest economic consequences with 86.5 billion dollars.

Conclusion:

From casualties perspective, we can see that:

convention event category (which includes lightning, tornado, thunderstorm wind, hail) is most frequncy and is the worst category from casualties perspective with 35,311. Yet, if we consider only the fatalities, the diference decrease to 1.31 times of extreme temp.
tornado represents the worst event type with 2.25 times of second event type flood. On the other hand, if you consider only the fatalities, the excessive heat takes the worst place with 18.9% more deaths than tornados.

From the total damage cost perspective, we can see that:

tropical cyclones events category have the greatest economic consequences with 1.19 times other, the second category.
hurricane events types have the greatest economic consequences with 1.69 times of second enevt type flood.

So, tornado is the most health harmful, but excessive heat cause the greatest mortalities and hurricane have the greatest economic consequences

The Worst Weather Events across US to the most health harmful is tornado, but excessive heat cause the greatest mortalities and hurricane have the greatest economic consequences considering the record in 2006.

### Publish references:### Reproducible Research: Peer Assessment 2 from Coursera by Johns Hopkins University
Online publish in: http://rpubs.com/mgmarques/NOAA_Analysis
github repo with RMarkdown source code: https://github.com/mgmarques/NOAA_Analysis