This report presents the worst weather events across the United States. It is considering their effects in harmful with respect to population health and what this have the greatest economic consequences.
To achieve that, we obtain the storm data from U.S. National Oceanic and Atmospheric Administration (NOAA). We process these data by tidying, in special the event types that have too much data quality issues. We considering a subset of the data since 1996, when we have all the event types recorded in the database, sum the fatalities with injuries to obtain the total casualties and calculate the damage cost consider the magnitude and property damage cost and crop damage cost.
Finally, our results indicate that convection events category are the most harmful with respect to population health, in special tornado, that belong this category, is the worst event type. On the other hand, we need pay more attention with first place of the mortalities caused by excessive heat event type. Also, the tropical cyclones events category have the greatest economic consequences considering the record in 2006, in special the hurricane event types that belong this category, is the greatest ones followed by flood.
Throughout this report you can always find the code that I used to generate my output presents here. When writing code chunks in the R markdown document, always use echo = TRUE so that someone else will be able to read the code. This assignment will be evaluated via peer assessment so it’s essential that my peer evaluators be able to review my code and my analysis together..
First, we set echo equal a TRUE and results equal a ‘hold’ as global options for this document.
library(knitr)
## Warning: package 'knitr' was built under R version 3.1.1
opts_chunk$set(echo = TRUE, results = 'hold')
Next, we load the libraries used in our analysis: ### Load data.table, xtable and ggplot2 libraries
library(data.table)
library(xtable)
library(ggplot2) # we shall use ggplot2 for plotting figures
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 3.1.1
## Loading required package: grid
We don’t use any random numbers or functions that use it, so we don’t need set a seed.
Next, we show our system environment, maybe it is useful for reproducible proposals:
sessionInfo()
## R version 3.1.0 Patched (2014-04-10 r65396)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
##
## locale:
## [1] LC_COLLATE=Portuguese_Brazil.1252 LC_CTYPE=Portuguese_Brazil.1252
## [3] LC_MONETARY=Portuguese_Brazil.1252 LC_NUMERIC=C
## [5] LC_TIME=Portuguese_Brazil.1252
##
## attached base packages:
## [1] grid stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] gridExtra_0.9.1 ggplot2_1.0.0 xtable_1.7-3 data.table_1.9.2
## [5] knitr_1.6
##
## loaded via a namespace (and not attached):
## [1] colorspace_1.2-4 digest_0.6.4 evaluate_0.5.5 formatR_0.10
## [5] gtable_0.1.2 htmltools_0.2.4 MASS_7.3-31 munsell_0.4.2
## [9] plyr_1.8.1 proto_0.3-10 Rcpp_0.11.2 reshape2_1.4
## [13] rmarkdown_0.2.46 scales_0.2.4 stringr_0.6.2 tools_3.1.0
## [17] yaml_2.1.13
This assignment makes use of Storm Data in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size.
There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.
National Weather Service Storm Data Documentation.
National Climatic Data Center Storm Events FAQ
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
This assignment instructions request to show any code that is needed to loading and preprocessing the data, like to:
First we define a function to check if the files exists in the path defined by file_path. If don’t exists stops execution of the current expression, and executes an error action.
check_file_exist <- function(file_path)
{
if (!file.exists(file_path))
stop("The ", file_path, " not found!") else TRUE
}
Next, we use data set and data_dir to define the file_path, call the check_file_exist function to check, if the file exist send a message to user for waiting the load of file. Finally returned with data set load to data variable.
load_data <- function(data_dir , fileURL, fileSource)
{
# Dataset check and load
file_path <- paste(data_dir, "/", fileSource , sep="")
if (!file.exists(file_path)) {
message(paste("Please Wait! Download...", fileURL, "..."));
download.file(fileURL, destfile=source_path);
}
data <- read.csv(bzfile(file_path),
header=TRUE, na.strings=c("NA",""))
}
Maybe you need to change this data_dir variable to yours peeress directory, because the line code that ask you inform where the data directory is find, use readline function, not function at markdown documents.
data_dir <- "C:/Users/Marcelo/NOAA_Analysis/Data";
Here, we initiate the main variables and run data load and preparation process for the activate data.
fileURL <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
fileSource <-"repdata-data-StormData.csv.bz2"
raw_data <- load_data(data_dir , fileURL, fileSource)
At last, we define some functions to simplify the actions for multiple plots in one graphic.
# define function to create multi-plot setup (nrow, ncol)
vp.setup <- function(x,y){
# create a new layout with grid
grid.newpage()
# define viewports and assign it to grid layout
pushViewport(viewport(layout = grid.layout(x,y)))
}
# define function to easily access layout (row, col)
vp.layout <- function(x,y){
viewport(layout.pos.row=x, layout.pos.col=y)
}
Now, we can proceed with the data pre-examination, first we examine the str:
str(raw_data)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29600 levels "5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13512 1872 4597 10591 4371 10093 1972 23872 24417 4597 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 34 levels " N"," NW","E",..: NA NA NA NA NA NA NA NA NA NA ...
## $ BGN_LOCATI: Factor w/ 54428 levels "- 1 N Albion",..: NA NA NA NA NA NA NA NA NA NA ...
## $ END_DATE : Factor w/ 6662 levels "1/1/1993 0:00:00",..: NA NA NA NA NA NA NA NA NA NA ...
## $ END_TIME : Factor w/ 3646 levels " 0900CST"," 200CST",..: NA NA NA NA NA NA NA NA NA NA ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 23 levels "E","ENE","ESE",..: NA NA NA NA NA NA NA NA NA NA ...
## $ END_LOCATI: Factor w/ 34505 levels "- .5 NNW","- 11 ESE Jay",..: NA NA NA NA NA NA NA NA NA NA ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 18 levels "-","?","+","0",..: 16 16 16 16 16 16 16 16 16 16 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 8 levels "?","0","2","B",..: NA NA NA NA NA NA NA NA NA NA ...
## $ WFO : Factor w/ 541 levels " CI","$AC","$AG",..: NA NA NA NA NA NA NA NA NA NA ...
## $ STATEOFFIC: Factor w/ 249 levels "ALABAMA, Central",..: NA NA NA NA NA NA NA NA NA NA ...
## $ ZONENAMES : Factor w/ 25111 levels " "| __truncated__,..: NA NA NA NA NA NA NA NA NA NA ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436780 levels "-2 at Deer Park\n",..: NA NA NA NA NA NA NA NA NA NA ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
Then we chech its summary.
xt <- xtable(summary(raw_data))
print(xt, type = "html")
| STATE__ | BGN_DATE | BGN_TIME | TIME_ZONE | COUNTY | COUNTYNAME | STATE | EVTYPE | BGN_RANGE | BGN_AZI | BGN_LOCATI | END_DATE | END_TIME | COUNTY_END | COUNTYENDN | END_RANGE | END_AZI | END_LOCATI | LENGTH | WIDTH | F | MAG | FATALITIES | INJURIES | PROPDMG | PROPDMGEXP | CROPDMG | CROPDMGEXP | WFO | STATEOFFIC | ZONENAMES | LATITUDE | LONGITUDE | LATITUDE_E | LONGITUDE_ | REMARKS | REFNUM | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Min. : 1.0 | 5/25/2011 0:00:00: 1202 | 12:00:00 AM: 10163 | CST :547493 | Min. : 0 | JEFFERSON : 7840 | TX : 83728 | HAIL :288661 | Min. : 0 | N : 86752 | COUNTYWIDE : 19680 | 4/27/2011 0:00:00: 1214 | 06:00:00 PM: 9802 | Min. :0 | Mode:logical | Min. : 0 | N : 28082 | COUNTYWIDE : 19731 | Min. : 0.0 | Min. : 0 | Min. :0 | Min. : 0 | Min. : 0 | Min. : 0.0 | Min. : 0 | K :424665 | Min. : 0.0 | K :281832 | OUN : 17393 | TEXAS, North : 12193 | :205988 | Min. : 0 | Min. :-14451 | Min. : 0 | Min. :-14455 | : 24013 | Min. : 1 |
| 2 | 1st Qu.:19.0 | 4/27/2011 0:00:00: 1193 | 06:00:00 PM: 7350 | EST :245558 | 1st Qu.: 31 | WASHINGTON: 7603 | KS : 53440 | TSTM WIND :219940 | 1st Qu.: 0 | W : 38446 | Countywide : 993 | 5/25/2011 0:00:00: 1196 | 05:00:00 PM: 8314 | 1st Qu.:0 | NA’s:902297 | 1st Qu.: 0 | S : 22510 | SOUTH PORTION : 833 | 1st Qu.: 0.0 | 1st Qu.: 0 | 1st Qu.:0 | 1st Qu.: 0 | 1st Qu.: 0 | 1st Qu.: 0.0 | 1st Qu.: 0 | M : 11330 | 1st Qu.: 0.0 | M : 1994 | JAN : 13889 | ARKANSAS, Central and North Central: 11738 | GREATER RENO / CARSON CITY / M - GREATER RENO / CARSON CITY / M : 639 | 1st Qu.:2802 | 1st Qu.: 7247 | 1st Qu.: 0 | 1st Qu.: 0 | Trees down. : 1110 | 1st Qu.:225575 |
| 3 | Median :30.0 | 6/9/2011 0:00:00 : 1030 | 04:00:00 PM: 7261 | MST : 68390 | Median : 75 | JACKSON : 6660 | OK : 46802 | THUNDERSTORM WIND: 82563 | Median : 0 | S : 37558 | SPRINGFIELD : 843 | 6/9/2011 0:00:00 : 1021 | 04:00:00 PM: 8104 | Median :0 | Median : 0 | W : 20119 | NORTH PORTION : 780 | Median : 0.0 | Median : 0 | Median :1 | Median : 50 | Median : 0 | Median : 0.0 | Median : 0 | 0 : 216 | Median : 0.0 | k : 21 | LWX : 13174 | IOWA, Central : 11345 | GREATER LAKE TAHOE AREA - GREATER LAKE TAHOE AREA : 592 | Median :3540 | Median : 8707 | Median : 0 | Median : 0 | Several trees were blown down. : 568 | Median :451149 | |
| 4 | Mean :31.2 | 5/30/2004 0:00:00: 1016 | 05:00:00 PM: 6891 | PST : 28302 | Mean :101 | FRANKLIN : 6256 | MO : 35648 | TORNADO : 60652 | Mean : 1 | E : 33178 | SOUTH PORTION: 810 | 4/4/2011 0:00:00 : 1007 | 12:00:00 PM: 7483 | Mean :0 | Mean : 1 | E : 20047 | CENTRAL PORTION: 617 | Mean : 0.2 | Mean : 8 | Mean :1 | Mean : 47 | Mean : 0 | Mean : 0.2 | Mean : 12 | B : 40 | Mean : 1.5 | 0 : 19 | PHI : 12551 | KANSAS, Southwest : 11212 | JEFFERSON - JEFFERSON : 303 | Mean :2875 | Mean : 6940 | Mean :1452 | Mean : 3509 | Trees were downed. : 446 | Mean :451149 | |
| 5 | 3rd Qu.:45.0 | 4/4/2011 0:00:00 : 1009 | 12:00:00 PM: 6703 | AST : 6360 | 3rd Qu.:131 | LINCOLN : 5937 | IA : 31069 | FLASH FLOOD : 54277 | 3rd Qu.: 1 | NW : 24041 | NORTH PORTION: 784 | 5/30/2004 0:00:00: 998 | 11:59:00 PM: 7184 | 3rd Qu.:0 | 3rd Qu.: 0 | NE : 14606 | SPRINGFIELD : 575 | 3rd Qu.: 0.0 | 3rd Qu.: 0 | 3rd Qu.:1 | 3rd Qu.: 75 | 3rd Qu.: 0 | 3rd Qu.: 0.0 | 3rd Qu.: 0 | 5 : 28 | 3rd Qu.: 0.0 | B : 9 | TSA : 12483 | GEORGIA, North and Central : 11120 | MADISON - MADISON : 302 | 3rd Qu.:4019 | 3rd Qu.: 9605 | 3rd Qu.:3549 | 3rd Qu.: 8735 |
| 3rd Qu.:676723 | |
| 6 | Max. :95.0 | 4/2/2006 0:00:00 : 981 | 03:00:00 PM: 6700 | HST : 2563 | Max. :873 | (Other) :866412 | NE : 30271 | FLOOD : 25326 | Max. :3749 | (Other):134990 | (Other) :591444 | (Other) :653450 | (Other) :622432 | Max. :0 | Max. :925 | (Other): 72096 | (Other) :380536 | Max. :2315.0 | Max. :4400 | Max. :5 | Max. :22000 | Max. :583 | Max. :1700.0 | Max. :5000 | (Other): 84 | Max. :990.0 | (Other): 9 | (Other):690738 | (Other) :595920 | (Other) :100444 | Max. :9706 | Max. : 17124 | Max. :9706 | Max. :106220 | (Other) :588295 | Max. :902297 | |
| 7 | (Other) :895866 | (Other) :857229 | (Other): 3631 | NA’s : 1589 | (Other):621339 | (Other) :170878 | NA’s :547332 | NA’s :287743 | NA’s :243411 | NA’s :238978 | NA’s :724837 | NA’s :499225 | NA’s :843563 | NA’s :465934 | NA’s :618413 | NA’s :142069 | NA’s :248769 | NA’s :594029 | NA’s :47 | NA’s :40 | NA’s :287433 |
Since our focus is in the event type, we checked all available values in EVTYPE and comparing to the event names defined on Page 6 of the National Weather Service Storm Data Documentation. We can see that 985 unique values are pretty messy, and we reproduce below only firsts 25th names in order for you can observe this issue too.
uniqevtype <- unique(raw_data$EVTYPE)
length(uniqevtype) # Total events types unique values
head(sort(uniqevtype), n = 25)
## [1] 985
## [1] HIGH SURF ADVISORY COASTAL FLOOD
## [3] FLASH FLOOD LIGHTNING
## [5] TSTM WIND TSTM WIND (G45)
## [7] WATERSPOUT WIND
## [9] ? ABNORMAL WARMTH
## [11] ABNORMALLY DRY ABNORMALLY WET
## [13] ACCUMULATED SNOWFALL AGRICULTURAL FREEZE
## [15] APACHE COUNTY ASTRONOMICAL HIGH TIDE
## [17] ASTRONOMICAL LOW TIDE AVALANCE
## [19] AVALANCHE BEACH EROSIN
## [21] Beach Erosion BEACH EROSION
## [23] BEACH EROSION/COASTAL FLOOD BEACH FLOOD
## [25] BELOW NORMAL PRECIPITATION
## 985 Levels: HIGH SURF ADVISORY COASTAL FLOOD ... WND
Since we need consider the magnitude for our cost, we need check how propdmgexp and cropdmgexp are filled.
unique(raw_data$PROPDMGEXP)
unique(raw_data$CROPDMGEXP)
## [1] K M <NA> B m + 0 5 6 ? 4 2 3 h
## [15] 7 H - 1 8
## Levels: - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
## [1] <NA> M K m B ? 0 k 2
## Levels: ? 0 2 B k K m M
Here we prossed with all data preparation. Fist we assign the data to a data frame variable named tidy, rename the columns names to lowercase for ease of coding and coerce the bgn_date to date class.
tidy <- raw_data
colnames(tidy) <- tolower(names(tidy))
tidy$bgn_date <- as.Date(tidy$bgn_date, format="%m/%e/%Y %H:%M:%S")
And then, we tidied the events by correcting spelling, remove unuseful characters (special characters, excessive spaces…), simplify names (e.g. remove some adjectives) and change event types to lowercase for easy handle.
# Replace 'and' or '&' with space
tidy$evtype <- gsub("(\\sAND(\\s|$))|(\\s&\\s)", " ", tidy$evtype)
# Remove leading and trailing space
tidy$evtype <- gsub("^\\s+|\\s+$", "", tidy$evtype)
# Remove multiple spaces
tidy$evtype <- gsub("\\s{2,}", " ", tidy$evtype)
# Lowercase Events Types
tidy$evtype <- tolower(tidy[, "evtype"])
# Spelling correction simplify names
tidy$evtype <- gsub("\\s{0,1}-\\s", "/", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("\\sand[[:alpha:]]{1,}", "/", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("\\sch([^[[:alpha:]]]{0,}|$)", " chill", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("\\schi([^[[:alpha:]]]{0,}|$)", " chill", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("\\schil($|\\s)", " chill", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("\\ssno($|\\s)", " snow", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("avalance", "avalanche", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("clou$", "cloud", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("drie", "dry", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("cstl", "coastal", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("erosin", "erosion", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("fld", "flood", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("floodin", "flood", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("flooding", "flood", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("freeze", "freezing", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("high wind (g40)", "high wind", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("ligntning", "lightning", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("micoburst", "microburst", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("non(-|\\s)[[:alpha:]]{1,}\\s(.*)", "\\2", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("precipatation", "precipitation", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("rip currents", "rip current", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("strm", "stream", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("tornadoes", "tornados", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("torndao", "tornados", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("tstm", "thunderstorm", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("tunderstorm", "thunderstorm", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("typhoo", "tornados", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("vog", "fog", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("waterspouts", "waterspout", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("wauseon", "", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("wayterspout", "waterspout", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("wi$|wi[^ndl]|win[^d]", "wind", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("winder", "winter", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("wintery", "winter", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("wintry", "winter", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- gsub("wnd", "wind", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("(hurricane|typhoon).*", "hurricane", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("(th?u?n?d?ee?r?|tstm).*", "thunderstorm wind", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("(tornado|torndao).*", "tornado", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("avalance.*", "avalanche", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("blizzard.*", "blizzard", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("coastal.*(flood|surge).*", "coastal flood", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("cold.*", "cold/wind chill", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("dust.*devel.*", "dust devil", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("dust.*storm.*", "dust storm", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("excessive.*heat.*", "excessive heat", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("extreme.*(code|wind).*", "extreme code/wind chill", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("flash.*flood.*|.*flood.*flash.*", "flash flood", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("funnel.*(cloud)*.*", "funnel cloud", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("hail.*", "hail", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("heat.*", "heat", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("heavy.*(rain|shower).*", "heavy rain", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("heavy.*snow.*", "heavy snow", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("high.*surf.*", "high surf", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("high.*wind.*", "high wind", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("ice.*storm.*", "ice storm", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("lake.*effect.*snow.*", "lake-effect snow", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("lig(h|n)tn?ing.*", "lightning", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("rip.*current.*", "rip current", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("seiche.*", "seiche", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("storm.*", "storm surge/tide", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("tropical.*storm.*", "tropical storm", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("volcanic.*", "volcanic ash", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("wild.*fire.*", "wildfire", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("summary.*", "summary", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("(.*flood.*|.*floooding.*)", "flood", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("(sleet.*|.*sleet.*|.*snow.*rain.*|.*rain.*snow.*)", "sleet", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("(.*free.*|.*frost.*)", "frost/freezing", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("(icy roads|.*snow.*|.*ice.*)", "ice/snow", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub(".*thunderstorm.*", "thunderstorm wind", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub(".*blizzard.*", "blizzard", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("(hurricane|gustnado)", "hurricane", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub("(.*seas.*|high tides|blow-out tides|blow-out tide)", "heavy seas/tides ", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub(".*erosion.*", "costal erosion", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub(".*whirlwind*", "whirlwind", tidy$evtype, ignore.case = TRUE)
tidy$evtype <- sub(".*landspout*", "landspout", tidy$evtype, ignore.case = TRUE)
Since we continue with a huge numbers of events types, probable we have better results if we group them. For this, we can see that National Climatic Data Center Storm Events FAQ mentioned that annual summary reports (see a eg at http://www.ncdc.noaa.gov/oa/climate/sd/annsum2009.pdf), This report groups the natural disaster events into seven category events in the following way:
- convection: lightning, tornado, thunderstorm wind and hail;
- extreme temperature: cold and heat; - flood: flash flood and river flood;
- marine: coastal storm, tsunami and rip current;
- tropical cyclones: tropical storm/hurricane;
- winter: winter storm, ice and avalance;
- other: drought, dust storm, dust devil, rain, fog, high wind, waterspout, fire weather mud slide, volcanic ash and miscellaneous.
We note that some events types are not mentioned and appear pertain to one of these first six categories. Since we note that we have much quality problems with the input, we made a second input review and reset these categories for it below:
- convection: lightning, tornado, thunderstorm, whirlwind, landspout and hail;
- extreme temperature: cold and heat;
- flood: flash flood and river flood;
- marine: coastal storm, tsunami and rip current;
- tropical cyclones: tropical storm/hurricane;
- winter: winter storm, ice, snow, sleet, frost/freezing, blizzard and avalance;
- other: all remaning events types.
Finaly, we include this group categories of events into our tidy dataset.
# create a new column for group events
tidy$evgroup <- NA
# 1) select convection group events and assign it to evgroup column.
tidy[grepl("lightning", tidy$evtype) | grepl("tornado", tidy$evtype) | grepl("thunderstorm", tidy$evtype) | grepl("whirlwind", tidy$evtype) | grepl("landspout", tidy$evtype) | grepl("hail", tidy$evtype), c("evgroup")] <- "convection"
# 2) select ext. temp group events and assign it to evgroup column.
tidy[grepl("cold", tidy$evtype) | grepl("heat", tidy$evtype), c("evgroup")] <- "ext. temp"
# 3) select flood group events and assign it to evgroup column.
tidy[grepl("flash flood", tidy$evtype) | grepl("flood", tidy$evtype) |grepl("river flood", tidy$evtype), c("evgroup")] <- "flood"
# 4) select marine group events and assign it to evgroup column.
tidy[grepl("coastal storm", tidy$evtype) | grepl("tsunami", tidy$evtype) | grepl("rip current", tidy$evtype), c("evgroup")] <- "marine"
# 5) select tropical cyclones group event and assign it to evgroup column.
tidy[grepl("tropical storm", tidy$evtype) | grepl("hurricane", tidy$evtype), c("evgroup")] <- "trop. cyclones"
# 6) select winter group events and assign it to evgroup column.
tidy[grepl("winter storm", tidy$evtype) | grepl("ice", tidy$evtype) | grepl("sleet", tidy$evtype) | grepl("frost/freezing", tidy$evtype) | grepl("ice/snow", tidy$evtype) | grepl("blizzard", tidy$evtype) | grepl("avalanche", tidy$evtype), c("evgroup")] <- "winter"
# 7) select other group events and assign it to evgroup column.
tidy[is.na(tidy$evgroup), "evgroup"] <- "other"
tidy <- tidy[tidy$bgn_date > as.Date("01/01/1996 0:00:00", "%m/%e/%Y %H:%M:%S"), ]
In a reviewed, we compare the annual summaries report with our data, and find at 2006 that total cost of Flash Flood and River Flood is 2,136.6 million and 1,726.3 million, but we find an entry in billions into our data so, we must replace the “b” by “m” on this record. See below this issue and the correction code:
# Show the events which input the property damage cost wrong in billion instead in million
tidy[grepl("115", tidy$propdmg) & grepl("32.5", tidy$cropdmg), c(2,8,25,26,27,28)]
# replace 'b' by 'm'
tidy[grepl("115", tidy$propdmg) & grepl("32.5", tidy$cropdmg) &
grepl("B", tidy$propdmgexp), c("propdmgexp")] <- "M"
## bgn_date evtype propdmg propdmgexp cropdmg cropdmgexp
## 567251 2005-12-31 flood 115 M 32.5 M
## 605953 2006-01-01 flood 115 B 32.5 M
In the examination the metadata documentation (see: http://thmp.info/metadata/other-storms/Other%20Storm%20Events%201955-2003.html), we see that Prop. Damage Exp.: expressed in dollar unit type, so considered that k is 1,000, m is 1,000,000, b is 1,000,000,000 and others values is 1**.
We create the function below to apply this treatment at propdmgexp and cropdmgexp.
get_exponent <- function(PROPdmgEXP) {
if (is.na(PROPdmgEXP)) exponent <- 1
else {
exp <- toupper(PROPdmgEXP);
if (exp == "K") { exponent <- 1000 }
else if (exp == "M") { exponent <- 1000000 }
else if (exp == "B") { exponent <- 1000000000 }
else { exponent <- 1 }
}
exponent
}
So the total damage cost is the sum of the property damage cost, resulting from propdmg times propdmgexp, and crop damage cost, resulting from cropdmg times cropdmgexp.
tidy$damage <- with(tidy, propdmg * sapply(propdmgexp, get_exponent)) +
with(tidy, cropdmg * sapply(cropdmgexp, get_exponent))
Next, we remove unuseful events from the data set to reduce the date size for easier handel, the reason for this is if an event had no fatalities and no injured and no crop or property damage, than for the purpose of this analysis it is useless.
# subset observations which have fatalities or injured or Damages
tidy <- subset(tidy, injuries > 0 | fatalities > 0 | damage > 0)
Also, we compute the total casualties and save it in a new variable called casualties.
tidy$casualties <- tidy$fatalities + tidy$injuries
Finally, we subset the columns which are useful for this analysis.
tidy <- tidy[tidy$evtype!="summary", c("bgn_date", "evgroup", "evtype", "fatalities", "injuries", "casualties", "damage")]
We initiate with check the summary count of event groups table to see which event group have most observations.
xt <- table(tidy$evgroup)
xt
##
## convection ext. temp flood marine other
## 153811 965 29519 621 13036
## trop. cyclones winter
## 609 2752
From the table above, we can see the convection event category, which includes lightning, tornado, thunderstorm wind, hail, occurs 153,811, it is most frequncy compare to other event group and it is 5.21 times of second category flood.
We proced with focus analysis in the weather category events to identify which are the worse ones from a perspective of Casualties and Damage Cost. For that, we proced with an aggregation of the data by evgroup.
For plotting needs, we create two new datasets, both ordinate decrescent and also reset the factor of evgroup according these order.
### Aggregating data by weather category events
aggr_grp <- aggregate(cbind(fatalities, injuries, casualties, damage) ~ evgroup, data = tidy, FUN=sum, na.rm=TRUE)
worse_casualties <- aggr_grp[with(aggr_grp, order(-casualties)),]
worse_casualties$evgroup <-factor(worse_casualties$evgroup, levels=worse_casualties[order(worse_casualties$casualties), "evgroup"])
worse_damage <- aggr_grp[with(aggr_grp, order(-damage)),];
worse_damage$evgroup <-factor(worse_damage$evgroup, levels=worse_damage[order(worse_damage$damage), "evgroup"]);
aggr <- aggregate(cbind(fatalities, injuries, casualties, damage) ~ bgn_date + evgroup, data = tidy[tidy$bgn_date > as.Date("01/11/2006 0:00:00", "%m/%e/%Y %H:%M:%S"), ], FUN=sum, na.rm=TRUE);
aggr2 <- aggregate(damage ~ evgroup, data = aggr, FUN=sum, na.rm=TRUE);
pd1 <- ggplot(worse_damage, aes(x=evgroup, y=damage)) +
ggtitle("Economic Impact (1996 to 2011)") +
xlab("") + ylab("damage Cost") +
theme(text = element_text(size=8),
axis.text.y = element_text(size= 6)) +
geom_bar(stat="identity") + coord_flip()
pc1 <- ggplot(worse_casualties, aes(x=evgroup, y=casualties)) +
xlab("Weather Category Events") + ylab("Casualties (fatalities + injuries)") +
theme(text = element_text(size=8),
axis.text.y = element_text(size= 6)) +
ggtitle("Causing Most Casualties (1996 to 2011)") +
geom_bar(stat="identity") + coord_flip()
pt1 <- qplot(bgn_date, damage, data=aggr, geom="line", group=evgroup, colour=evgroup, position="identity") +
ggtitle("Weather Event Group Damage Cost (11/06 to 11/2011) Across US") +
theme(text = element_text(size=8))
par(mfrow=c(2,2),mar=c(5.1,4.1,4.1,2.1), oma=c(0,0,0,0))
vp.setup(2,2)
print(pt1, vp=vp.layout(1, 1:2))
print(pd1, vp=vp.layout(2,1))
print(pc1, vp=vp.layout(2,2))
From this graphic above we can see that convection represents the worst category from casualties perspective is convection with 35,311. It is 3.58 times of second category extreme temp, but if you consider only the fatalities see that difference decrease to 1.31 times of extreme temp.
From the total damage cost perspective, we can see that tropical cyclones events have the greatest economic consequences with 94.8 billion dollars. This largely due to the catastrophic events of the first half of 2006 as the New Orleans caused by Katrina. Note that this events are really a record. If we only consider the last 5 year, see the first line graph above, we note that convection and flood take first place with 29.5 and 26.3 billion dollars respectively.
Well, the other category appears like a big bag of cats, since it appears with extreme importance to damage cost, it is interesting make other analysis to check the worst weather events from its types.
So, we proced with focus analysis in the weather events types to identify which are the 10th worse ones from a perspective of Damage Cost, Casualties and fatalities. For that, we proced with an aggregation of the data by evtype.
For plotting needs, we create two new datasets, both ordinate decrescent and also reset the factor of evtype according these order.
### Aggregating data by event types
aggr_type <- aggregate(cbind(fatalities, injuries, casualties, damage) ~ evtype, data = tidy, FUN=sum, na.rm=TRUE)
### Getting the top ten Weather Event Types of Casualties and Damage Cost
top_damage <- aggr_type[with(aggr_type, order(-damage)),][1:10,]
top_damage$evtype <-factor(top_damage$evtype, levels=top_damage[order(top_damage$damage), "evtype"])
top_casualties <- aggr_type[with(aggr_type, order(-casualties)),][1:10,]
top_casualties$evtype <-factor(top_casualties$evtype, levels=top_casualties[order(top_casualties$casualties), "evtype"])
top_fatalities <- aggr_type[with(aggr_type, order(-fatalities)),][1:10,]
top_fatalities$evtype <-factor(top_fatalities$evtype, levels=top_fatalities[order(top_fatalities$fatalities), "evtype"])
pd2 <- ggplot(top_damage, aes(x=evtype, y=damage)) +
ggtitle("Having Greatest Economic Impact") +
xlab("Weather Event Type") + ylab("damage Cost") +
theme(text = element_text(size=8),
axis.text.y = element_text(size= 6)) +
geom_bar(stat="identity") + coord_flip()
pc2 <- ggplot(top_casualties, aes(x=evtype, y=casualties)) +
xlab("") + ylab("Casualties (fatalities + injuries)") +
theme(text = element_text(size=8),
axis.text.y = element_text(size= 6)) +
ggtitle("Causing Most Casualties") +
geom_bar(stat="identity") + coord_flip()
pf2 <- ggplot(top_fatalities, aes(x=evtype, y=fatalities)) +
ggtitle("Causing Most Fatalities") +
xlab("") + ylab("Fatalities") +
theme(text = element_text(size=8),
axis.text.y = element_text(size= 6)) +
geom_bar(stat="identity") + coord_flip()
par(mar=c(5.1,4.1,4.1,2.1), oma=c(0,0,0,0))
grid.arrange(pd2, pc2, pf2, nrow = 1, ncol=3)
From this graphic above we can see that tornado represents the worst eventtype from casualties perspective is tornado with 22,183. It is 2.25 times of second event type flood. On the other hand, if you consider only the fatalities, the tornados surge as the second and excessive heat takes the worst place with 18.9% more deaths than tornados.
From the total damage cost perspective, we can see that hurricane events have the greatest economic consequences with 86.5 billion dollars.
From casualties perspective, we can see that:
convention event category (which includes lightning, tornado, thunderstorm wind, hail) is most frequncy and is the worst category from casualties perspective with 35,311. Yet, if we consider only the fatalities, the diference decrease to 1.31 times of extreme temp.
tornado represents the worst event type with 2.25 times of second event type flood. On the other hand, if you consider only the fatalities, the excessive heat takes the worst place with 18.9% more deaths than tornados.
From the total damage cost perspective, we can see that:
tropical cyclones events category have the greatest economic consequences with 1.19 times other, the second category.
hurricane events types have the greatest economic consequences with 1.69 times of second enevt type flood.
So, tornado is the most health harmful, but excessive heat cause the greatest mortalities and hurricane have the greatest economic consequences
The Worst Weather Events across US to the most health harmful is tornado, but excessive heat cause the greatest mortalities and hurricane have the greatest economic consequences considering the record in 2006.