Synopsis

In this analysis we consider the effects of harmful meteorological events for communities and municipalities.
Based on the collected data from U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database, we need to understand which of these events are the most destructive and malicious. We also estimate the probability of occurrence of these events, based on the frequency of their occurrence in the past. The ultimate goal of this research is to prepare for severe weather events and to prioritize resources for different types of events.
Based on this analysis, we can conclude that excessive heat causes the most fatalities, tornadoes the most injuries, drought has the most damages in crops and hurricanes/typhoons has the greatest economic impacts.
Then we present polar plots to demonstrate the frequencies of the events with the most impact during the year.

Prepare the enviroment

Throughout this analysis you can always find the code that I used to generate the output. When writing code chunks in the R markdown document,I always used echo = TRUE so that everyone to be able to read the code.
Then we load the libraries used in our analysis:

library(knitr)
## Warning: package 'knitr' was built under R version 3.1.3
opts_chunk$set(echo = TRUE)

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.1.3
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 3.1.3
## Loading required package: grid
library(plotrix)
## Warning: package 'plotrix' was built under R version 3.1.3
library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Next, we show our system environment, maybe it is useful for reproducible proposals:

sessionInfo()
## R version 3.1.2 (2014-10-31)
## Platform: i386-w64-mingw32/i386 (32-bit)
## 
## locale:
## [1] LC_COLLATE=Greek_Greece.1253  LC_CTYPE=Greek_Greece.1253   
## [3] LC_MONETARY=Greek_Greece.1253 LC_NUMERIC=C                 
## [5] LC_TIME=Greek_Greece.1253    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] digest_0.6.8    evaluate_0.7    formatR_1.2     htmltools_0.2.6
##  [5] knitr_1.10.5    magrittr_1.5    rmarkdown_0.6.1 stringi_0.4-1  
##  [9] stringr_1.0.0   tools_3.1.2     yaml_2.1.13

Data Proceessing

Data

The data for this analysis come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file here: Storm Data [47Mb]

There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.

The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.

Load Data

First, we need a raw data for work. We check if the file is already in the working directory. If not, we download the file and load the data and assign it to the “data” data frame. We set NA’s to be values with: “NA” or “”.

fileURL <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"

if(!file.exists("StormData.csv.bz2")) {
        download.file(fileURL, "./StormData.csv.bz2", method = "curl")
        }

data <- read.csv(bzfile("StormData.csv.bz2"),
                 header = TRUE,  na.strings = c("NA", ""))

Data Exploration

Now, we can proceed with the data pre-examination, first we examine the structure of the data set.

str(data)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29600 levels "5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13512 1872 4597 10591 4371 10093 1972 23872 24417 4597 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 34 levels "  N"," NW","E",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ BGN_LOCATI: Factor w/ 54428 levels "- 1 N Albion",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ END_DATE  : Factor w/ 6662 levels "1/1/1993 0:00:00",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ END_TIME  : Factor w/ 3646 levels " 0900CST"," 200CST",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 23 levels "E","ENE","ESE",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ END_LOCATI: Factor w/ 34505 levels "- .5 NNW","- 11 ESE Jay",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 18 levels "-","?","+","0",..: 16 16 16 16 16 16 16 16 16 16 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 8 levels "?","0","2","B",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ WFO       : Factor w/ 541 levels " CI","$AC","$AG",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ STATEOFFIC: Factor w/ 249 levels "ALABAMA, Central",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ ZONENAMES : Factor w/ 25111 levels "                                                                                                                               "| __truncated__,..: NA NA NA NA NA NA NA NA NA NA ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436780 levels "-2 at Deer Park\n",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

We see that there are a lot of variables, so we will need to narrow them down.
Then we look into EVTYPE variable that has the event type of each observation.

length(unique(data$EVTYPE))
## [1] 985

As we can see above there are 985 unique event types in the raw data set.
In National Weather Service Storm Data Documentation paragraphs 7.1 through 7.48 there are 48 event types, so later we will group them to achieve this.

Data Preparation

First, we add a new column named “YEAR” to specify the year of each observation.

data$YEAR <- as.numeric(format(as.Date(data$BGN_DATE,format="%m/%d/%Y %H:%M:%S"),"%Y"))

Then we select the columns needed for our analysis and subset our data set. These columns are:
“BGN_DATE”, “STATE”, “COUNTYNAME”, “EVTYPE”, “FATALITIES”, “INJURIES”, “CROPDMG”, “CROPDMGEXP”, “PROPDMG”, “PROPDMGEXP”, “YEAR”

cols <- c("BGN_DATE","STATE","COUNTYNAME", "EVTYPE", "FATALITIES", "INJURIES",
          "CROPDMG", "CROPDMGEXP", "PROPDMG", "PROPDMGEXP", "YEAR")
   
data <- subset(data, select = cols)

Next, we remove unuseful events from the data set to reduce the date size for easier handel, the reason for this is if an event had no fatalities and no injured and no crop or property damage, than for the purpose of this analysis it is useless. We name this new data set “tidy”.

tidy <- subset(data, FATALITIES > 0 | INJURIES > 0 | CROPDMG > 0 | PROPDMG > 0)

As mentioned above, more recent years should be considered more complete. As we can see here, after 1996 data is more complete as all 48 types are recorded. So we calculate the percentage of observations after 1996.

per96 <- round(nrow(tidy[tidy$YEAR >=1996,]) / nrow(tidy) * 100, 2)
per96
## [1] 79.06

As 79.06 % of the observations are after 1996, we subset our data set for the years after 1996 and we create a new data set called “clean.1996”. Then we check EVTYPE again to see if it has changed. We also remove the previous data sets (“data”, “tidy”) to save memory.

clean.1996 <- tidy[tidy$YEAR >= 1996,]

rm(data, tidy)

unique(clean.1996$EVTYPE)
##   [1] WINTER STORM              TORNADO                  
##   [3] TSTM WIND                 HIGH WIND                
##   [5] FLASH FLOOD               FREEZING RAIN            
##   [7] EXTREME COLD              LIGHTNING                
##   [9] HAIL                      FLOOD                    
##  [11] TSTM WIND/HAIL            EXCESSIVE HEAT           
##  [13] RIP CURRENTS              Other                    
##  [15] HEAVY SNOW                WILD/FOREST FIRE         
##  [17] ICE STORM                 BLIZZARD                 
##  [19] STORM SURGE               Ice jam flood (minor     
##  [21] DUST STORM                STRONG WIND              
##  [23] DUST DEVIL                Tstm Wind                
##  [25] URBAN/SML STREAM FLD      FOG                      
##  [27] ROUGH SURF                Heavy Surf               
##  [29] Dust Devil                HEAVY RAIN               
##  [31] Marine Accident           AVALANCHE                
##  [33] Freeze                    DRY MICROBURST           
##  [35] Strong Wind               WINDS                    
##  [37] COASTAL STORM             Erosion/Cstl Flood       
##  [39] River Flooding            WATERSPOUT               
##  [41] DAMAGING FREEZE           Damaging Freeze          
##  [43] HURRICANE                 TROPICAL STORM           
##  [45] Beach Erosion             High Surf                
##  [47] Heavy Rain/High Surf      Unseasonable Cold        
##  [49] Early Frost               Wintry Mix               
##  [51] Extreme Cold              DROUGHT                  
##  [53] Coastal Flooding          Torrential Rainfall      
##  [55] Landslump                 Hurricane Edouard        
##  [57] Coastal Storm             TIDAL FLOODING           
##  [59] Tidal Flooding            Strong Winds             
##  [61] EXTREME WINDCHILL         Glaze                    
##  [63] Extended Cold             Whirlwind                
##  [65] Heavy snow shower         Light snow               
##  [67] COASTAL FLOOD             Light Snow               
##  [69] MIXED PRECIP              COLD                     
##  [71] Freezing Spray            DOWNBURST                
##  [73] Mudslides                 Microburst               
##  [75] Mudslide                  Cold                     
##  [77] SNOW                      Coastal Flood            
##  [79] Snow Squalls              Wind Damage              
##  [81] Light Snowfall            Freezing Drizzle         
##  [83] Gusty wind/rain           GUSTY WIND/HVY RAIN      
##  [85] Wind                      Cold Temperature         
##  [87] Heat Wave                 Snow                     
##  [89] COLD AND SNOW             HEAVY SURF               
##  [91] RAIN/SNOW                 WIND                     
##  [93] FREEZE                    TSTM WIND (G45)          
##  [95] Gusty Winds               GUSTY WIND               
##  [97] TSTM WIND 40              TSTM WIND 45             
##  [99] HARD FREEZE               TSTM WIND (41)           
## [101] HEAT                      RIVER FLOOD              
## [103] TSTM WIND (G40)           RIP CURRENT              
## [105] HIGH SURF                 MUD SLIDE                
## [107] Frost/Freeze              SNOW AND ICE             
## [109] COASTAL FLOODING          AGRICULTURAL FREEZE      
## [111] WINTER WEATHER            STRONG WINDS             
## [113] SNOW SQUALL               ICY ROADS                
## [115] OTHER                     THUNDERSTORM             
## [117] Hypothermia/Exposure      HYPOTHERMIA/EXPOSURE     
## [119] Lake Effect Snow          Freezing Rain            
## [121] Mixed Precipitation       BLACK ICE                
## [123] COASTALSTORM              LIGHT SNOW               
## [125] DAM BREAK                 Gusty winds              
## [127] blowing snow              FREEZING DRIZZLE         
## [129] FROST                     GRADIENT WIND            
## [131] UNSEASONABLY COLD         GUSTY WINDS              
## [133] TSTM WIND AND LIGHTNING   gradient wind            
## [135] Gradient wind             Freezing drizzle         
## [137] WET MICROBURST            Heavy surf and wind      
## [139] FUNNEL CLOUD              TYPHOON                  
## [141] LANDSLIDES                HIGH SWELLS              
## [143] HIGH WINDS                SMALL HAIL               
## [145] UNSEASONAL RAIN           COASTAL FLOODING/EROSION 
## [147]  TSTM WIND (G45)          TSTM WIND  (G45)         
## [149] HIGH WIND (G40)           TSTM WIND (G35)          
## [151] GLAZE                     COASTAL EROSION          
## [153] UNSEASONABLY WARM         SEICHE                   
## [155] COASTAL  FLOODING/EROSION HYPERTHERMIA/EXPOSURE    
## [157] WINTRY MIX                RIVER FLOODING           
## [159] ROCK SLIDE                GUSTY WIND/HAIL          
## [161] HEAVY SEAS                 TSTM WIND               
## [163] LANDSPOUT                 RECORD HEAT              
## [165] EXCESSIVE SNOW            LAKE EFFECT SNOW         
## [167] FLOOD/FLASH/FLOOD         MIXED PRECIPITATION      
## [169] WIND AND WAVE             FLASH FLOOD/FLOOD        
## [171] LIGHT FREEZING RAIN       ICE ROADS                
## [173] HIGH SEAS                 RAIN                     
## [175] ROUGH SEAS                TSTM WIND G45            
## [177] NON-SEVERE WIND DAMAGE    WARM WEATHER             
## [179] THUNDERSTORM WIND (G40)   LANDSLIDE                
## [181] HIGH WATER                 FLASH FLOOD             
## [183] LATE SEASON SNOW          WINTER WEATHER MIX       
## [185] ROGUE WAVE                FALLING SNOW/ICE         
## [187] NON-TSTM WIND             NON TSTM WIND            
## [189] MUDSLIDE                  BRUSH FIRE               
## [191] BLOWING DUST              VOLCANIC ASH             
## [193]    HIGH SURF ADVISORY     HAZARDOUS SURF           
## [195] WILDFIRE                  COLD WEATHER             
## [197] WHIRLWIND                 ICE ON ROAD              
## [199] SNOW SQUALLS              DROWNING                 
## [201] EXTREME COLD/WIND CHILL   MARINE TSTM WIND         
## [203] HURRICANE/TYPHOON         DENSE FOG                
## [205] WINTER WEATHER/MIX        FROST/FREEZE             
## [207] ASTRONOMICAL HIGH TIDE    HEAVY SURF/HIGH SURF     
## [209] TROPICAL DEPRESSION       LAKE-EFFECT SNOW         
## [211] MARINE HIGH WIND          THUNDERSTORM WIND        
## [213] TSUNAMI                   STORM SURGE/TIDE         
## [215] COLD/WIND CHILL           LAKESHORE FLOOD          
## [217] MARINE THUNDERSTORM WIND  MARINE STRONG WIND       
## [219] ASTRONOMICAL LOW TIDE     DENSE SMOKE              
## [221] MARINE HAIL               FREEZING FOG             
## 985 Levels:    HIGH SURF ADVISORY  COASTAL FLOOD ... WND

Clean EVTYPE Variable

As we can see, there are 222 event types, so now we will group them. To achieve this we used the official site Storm Events Database. We took every event type shown above and we used the state, the county name and the date to search and compare the search results with the one from the data set in order to group them in 48 event types as shown in National Weather Service Storm Data Documentation paragraphs 7.1 through 7.48.
Here is an example:
First we check the observation with the specific event type (“ROCK SLIDE”).

clean.1996[which(clean.1996$EVTYPE == "ROCK SLIDE"),]
##                 BGN_DATE STATE COUNTYNAME     EVTYPE FATALITIES INJURIES
## 345192 3/22/1998 0:00:00    VA     VAZ057 ROCK SLIDE          0        0
##        CROPDMG CROPDMGEXP PROPDMG PROPDMGEXP YEAR
## 345192       0       <NA>     150          K 1998

Then we use the state, the county name and the date to search and compare the search results with the one from the data set as seen above. So we group this observation with “DEBRIS FLOW” event type category.

So after a very very long research, we grouped “EVTYPE” variable using these commands.

# Replace 'and' or '&' with space
clean.1996$EVTYPE <- gsub("(\\sAND(\\s|$))|(\\s&\\s)", " ", clean.1996$EVTYPE)

# Remove leading and trailing space
clean.1996$EVTYPE <- gsub("^\\s+|\\s+$", "", clean.1996$EVTYPE)

# Remove multiple spaces
clean.1996$EVTYPE <- gsub("\\s{2,}", " ", clean.1996$EVTYPE)

# uppercase all
clean.1996$EVTYPE <- toupper(clean.1996[, "EVTYPE"])

# rename and group EVTYPES
clean.1996$EVTYPE <- gsub("GUSTY ", "", clean.1996$EVTYPE)
clean.1996$EVTYPE <- gsub("TSTM", "THUNDERSTORM", clean.1996$EVTYPE)
clean.1996[grepl("COASTAL", clean.1996$EVTYPE)|
               grepl("EROSION", clean.1996$EVTYPE)|
               grepl("TIDAL", clean.1996$EVTYPE), c("EVTYPE")] <- "COASTAL FLOOD"
clean.1996[grepl("MUDSLIDES", clean.1996$EVTYPE)|
               grepl("LANDSL", clean.1996$EVTYPE)|
               grepl("ROCK SLIDE", clean.1996$EVTYPE), c("EVTYPE")] <- "DEBRIS FLOW"
clean.1996[grepl("FOG", clean.1996$EVTYPE), c("EVTYPE")] <- "DENSE FOG"
clean.1996[grepl("WHIRLWIND", clean.1996$EVTYPE)|
               grepl("LANDSPOUT", clean.1996$EVTYPE), c("EVTYPE")] <- "DUST DEVIL"
clean.1996[grepl("BLOWING DUST", clean.1996$EVTYPE), c("EVTYPE")] <- "DUST STORM"
clean.1996[grepl("WARM WEATHER", clean.1996$EVTYPE), c("EVTYPE")] <- "EXCESSIVE HEAT"
clean.1996[grepl("EXTREME", clean.1996$EVTYPE), c("EVTYPE")] <- "EXTREME COLD/WIND CHILL"
clean.1996[grepl("FLASH", clean.1996$EVTYPE), c("EVTYPE")] <- "FLASH FLOOD"
clean.1996[grepl("FROST", clean.1996$EVTYPE) |
               grepl("FREEZE", clean.1996$EVTYPE), c("EVTYPE")] <- "FROST/FREEZE"
clean.1996[grepl("SMALL HAIL", clean.1996$EVTYPE), c("EVTYPE")] <- "HAIL"
clean.1996[grepl("FREEZING SPRAY", clean.1996$EVTYPE), c("EVTYPE")] <- "FREEZING FOG"
clean.1996[grepl("HIGH WATER", clean.1996$EVTYPE) |
               grepl("ICE JAM", clean.1996$EVTYPE) |
               grepl("DAM BREAK", clean.1996$EVTYPE) |
               grepl("DROWNING", clean.1996$EVTYPE) |
               grepl("RIVER", clean.1996$EVTYPE) |
               grepl("URBAN", clean.1996$EVTYPE), c("EVTYPE")] <- "FLOOD"
clean.1996[grepl("HEAT WAVE", clean.1996$EVTYPE) |
               grepl("HYPERTHERMIA", clean.1996$EVTYPE) |
               grepl("RECORD HEAT", clean.1996$EVTYPE) |
               grepl("WARM", clean.1996$EVTYPE), c("EVTYPE")] <- "HEAT"
clean.1996[grepl("HEAT WAVE", clean.1996$EVTYPE) |
               grepl("HEAVY SNOW SHOWER", clean.1996$EVTYPE) |
               grepl("EXCESSIVE SNOW", clean.1996$EVTYPE) |
               grepl("LATE SEASON SNOW", clean.1996$EVTYPE), c("EVTYPE")] <- "HEAVY SNOW"
clean.1996[grepl("HEAVY R", clean.1996$EVTYPE) |
               grepl("HIGH WINDS", clean.1996$EVTYPE) |
               grepl("MUD", clean.1996$EVTYPE) |
               grepl("OTHER", clean.1996$EVTYPE) |
               grepl("^RAIN$", clean.1996$EVTYPE) |
               grepl("RAINFALL", clean.1996$EVTYPE) |
               grepl("HVY RAIN", clean.1996$EVTYPE) |
               grepl("NAL RAIN", clean.1996$EVTYPE), c("EVTYPE")] <- "HEAVY RAIN"
clean.1996[grepl("SURF", clean.1996$EVTYPE) |
               grepl("SWELLS", clean.1996$EVTYPE) |
               grepl("ROGUE WAVE", clean.1996$EVTYPE) |
               grepl("SEAS$", clean.1996$EVTYPE), c("EVTYPE")] <- "HIGH SURF"
clean.1996[grepl("HURRICANE", clean.1996$EVTYPE) |
               grepl("TYPHOON", clean.1996$EVTYPE), c("EVTYPE")] <- "HURRICANE/TYPHOON"
clean.1996[grepl("FALLING SNOW", clean.1996$EVTYPE), c("EVTYPE")] <- "ICE STORM"
clean.1996[grepl("EFFECT", clean.1996$EVTYPE), c("EVTYPE")] <- "LAKE-EFFECT SNOW"
clean.1996[grepl("MARINE ACCIDENT", clean.1996$EVTYPE), c("EVTYPE")] <- "MARINE HIGH WIND"
clean.1996[grepl("RIP CURRENTS", clean.1996$EVTYPE), c("EVTYPE")] <- "RIP CURRENT"
clean.1996[grepl("HIGH TIDE", clean.1996$EVTYPE) |
               grepl("SURGE", clean.1996$EVTYPE), c("EVTYPE")] <- "STORM SURGE/TIDE"
clean.1996[grepl("GRADIENT WIND", clean.1996$EVTYPE) |
               grepl("NON", clean.1996$EVTYPE) |
               grepl("WINDS", clean.1996$EVTYPE) |
               grepl("WIND/RAIN", clean.1996$EVTYPE), c("EVTYPE")] <- "STRONG WIND"
clean.1996[grepl("GRADIENT WIND", clean.1996$EVTYPE) |
               grepl("^THUNDERSTORM", clean.1996$EVTYPE) |
               grepl("BURST", clean.1996$EVTYPE) |
               grepl("WIND/HAIL", clean.1996$EVTYPE), c("EVTYPE")] <- "THUNDERSTORM WIND"
clean.1996[grepl("FIRE", clean.1996$EVTYPE), c("EVTYPE")] <- "WILDFIRE"
clean.1996[grepl("^SNOW", clean.1996$EVTYPE) |
               grepl("BLOWING SNOW", clean.1996$EVTYPE) |
               grepl("ING RAIN", clean.1996$EVTYPE) |
               grepl("ROAD", clean.1996$EVTYPE) |
               grepl("BLACK ICE", clean.1996$EVTYPE) |
               grepl("COLD SNOW", clean.1996$EVTYPE) |
               grepl("MIX", clean.1996$EVTYPE) |
               grepl("GLAZE", clean.1996$EVTYPE) |
               grepl("LIGHT SNOW", clean.1996$EVTYPE) | 
               grepl("GLAZE", clean.1996$EVTYPE) |
               grepl("RAIN/SNOW", clean.1996$EVTYPE) |
               grepl("DRIZZLE", clean.1996$EVTYPE), c("EVTYPE")] <- "WINTER WEATHER"
clean.1996[grepl("^COLD", clean.1996$EVTYPE) |
               grepl("COLD$", clean.1996$EVTYPE) |
               grepl("HYPOTHERMIA", clean.1996$EVTYPE), c("EVTYPE")] <- "COLD/WIND CHILL"
clean.1996[grepl("^HIGH WIND", clean.1996$EVTYPE) |
               grepl("^WIND", clean.1996$EVTYPE) |
               grepl("HYPOTHERMIA", clean.1996$EVTYPE), c("EVTYPE")] <- "HIGH WIND"

# Print ordered unique(clean.1996$EVTYPE) to see result
o <- unique(clean.1996$EVTYPE)
o <- o[order(o)]
o
##  [1] "ASTRONOMICAL LOW TIDE"    "AVALANCHE"               
##  [3] "BLIZZARD"                 "COASTAL FLOOD"           
##  [5] "COLD/WIND CHILL"          "DEBRIS FLOW"             
##  [7] "DENSE FOG"                "DENSE SMOKE"             
##  [9] "DROUGHT"                  "DUST DEVIL"              
## [11] "DUST STORM"               "EXCESSIVE HEAT"          
## [13] "EXTREME COLD/WIND CHILL"  "FLASH FLOOD"             
## [15] "FLOOD"                    "FREEZING FOG"            
## [17] "FROST/FREEZE"             "FUNNEL CLOUD"            
## [19] "HAIL"                     "HEAT"                    
## [21] "HEAVY RAIN"               "HEAVY SNOW"              
## [23] "HIGH SURF"                "HIGH WIND"               
## [25] "HURRICANE/TYPHOON"        "ICE STORM"               
## [27] "LAKE-EFFECT SNOW"         "LAKESHORE FLOOD"         
## [29] "LIGHTNING"                "MARINE HAIL"             
## [31] "MARINE HIGH WIND"         "MARINE STRONG WIND"      
## [33] "MARINE THUNDERSTORM WIND" "RIP CURRENT"             
## [35] "SEICHE"                   "STORM SURGE/TIDE"        
## [37] "STRONG WIND"              "THUNDERSTORM WIND"       
## [39] "TORNADO"                  "TROPICAL DEPRESSION"     
## [41] "TROPICAL STORM"           "TSUNAMI"                 
## [43] "VOLCANIC ASH"             "WATERSPOUT"              
## [45] "WILDFIRE"                 "WINTER STORM"            
## [47] "WINTER WEATHER"

During this research, we found a strange data observation.

clean.1996[grepl("CA", clean.1996$STATE) & grepl("B", clean.1996$PROPDMGEXP) & grepl("FLOOD", clean.1996$EVTYPE),]
##                BGN_DATE STATE COUNTYNAME EVTYPE FATALITIES INJURIES
## 605953 1/1/2006 0:00:00    CA       NAPA  FLOOD          0        0
##        CROPDMG CROPDMGEXP PROPDMG PROPDMGEXP YEAR
## 605953    32.5          M     115          B 2006

This means that property damage for this observation is 115 Billion Dollars. So we checked it and corrected it in Million Dollars.

clean.1996[grepl("CA", clean.1996$STATE) & grepl("B", clean.1996$PROPDMGEXP) & grepl("FLOOD", clean.1996$EVTYPE), c("PROPDMGEXP")]<-"M"
Calculate Crop and Property Damage

Then, we made the table of CROPDMGEXP variable in order to examine it.

table(clean.1996$CROPDMGEXP, useNA = "ifany")
## 
##      ?      0      2      B      k      K      m      M   <NA> 
##      0      0      0      2      0  96787      0   1762 102767
a<-clean.1996[is.na(clean.1996$CROPDMGEXP),]
table(a$CROPDMG, useNA = "ifany")
## 
##      0 
## 102767

We see that, when CROPDMGEXP variable is “NA” crop damage is zero.
Then, we made the table of PROPDMGEXP variable in order to examine it.

table(clean.1996$PROPDMGEXP, useNA = "ifany")
## 
##      -      ?      +      0      1      2      3      4      5      6 
##      0      0      0      0      0      0      0      0      0      0 
##      7      8      B      h      H      K      m      M   <NA> 
##      0      0     31      0      0 185474      0   7365   8448
a<-clean.1996[is.na(clean.1996$PROPDMGEXP),]
table(a$PROPDMG, useNA = "ifany")
## 
##    0 
## 8448

We see that, when PROPDMGEXP variable is “NA” crop damage is zero.
We also see that exponents are: “B”, “M”, “K” for “Billions”, “Millions”, “Thousands”. So, in order to calculate crop and property damage we used a function to translate those exponents. Then we converted damage to millions of dollars (M). The result is saved in two new variables, “CROPDAMAGE” and “PROPDAMAGE”. We also created a variable “TOTAL” to calculate the total damage of an event type.

# Function to calculate damage in dollars
use_exponent <- function(varexp) {
        if (is.na(varexp))  exponent <- 1 
        else {
            exp <- toupper(varexp);
            if (exp == "K") { exponent <- 1000 }
            else if (exp == "M") { exponent <- 1000000 }
            else if (exp == "B") { exponent <- 1000000000 }
            }
        exponent
        }

# Convert damage to millions (M)
clean.1996$CROPDAMAGE <- round(with(clean.1996,
                                    CROPDMG * sapply(CROPDMGEXP, 
                                                     use_exponent)/1000000), 3)
clean.1996$PROPDAMAGE <- round(with(clean.1996,
                                    PROPDMG * sapply(PROPDMGEXP, 
                                                     use_exponent)/1000000), 3)

clean.1996$TOTAL <- clean.1996$CROPDAMAGE + clean.1996$PROPDAMAGE

Results

First question: Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

Aggregate Fatalities and Injuries

# aggregate fatalities
fatalities <- aggregate(FATALITIES ~ EVTYPE, clean.1996, sum)
fatalities <- fatalities[order(-fatalities$FATALITIES), ]
fatalities$EVTYPE <- factor(fatalities$EVTYPE,
                            levels = fatalities[order(fatalities$FATALITIES),
                                                "EVTYPE"])
# aggregate injuries
injuries <- aggregate(INJURIES ~ EVTYPE, clean.1996, sum)
injuries <- injuries[order(-injuries$INJURIES), ]
injuries$EVTYPE <- factor(injuries$EVTYPE,
                            levels = injuries[order(injuries$INJURIES),
                                                "EVTYPE"])

Table of ten most harmful event types with respect to population health

After that, we make a table of ten most impactful in public health events

library(knitr)
## Warning: package 'knitr' was built under R version 3.1.3
table.pub.health <- merge(fatalities, injuries, 
                            by = 'EVTYPE', all.x = TRUE, all.y = TRUE)
table.pub.health <- format(table.pub.health, big.mark=",", decimal.mark=".")  
table.pub.health <- table.pub.health[order(table.pub.health$FATALITIES, 
                                           decreasing=TRUE), ]
kable(table.pub.health[1:10,], format = "pandoc",
      caption = "Events with greatest consequences in public health", 
      row.names = FALSE,
      col.names=c("Event Type", "Fatalities", "Injuries"), 
      align=c("l", "r", "r"))
Events with greatest consequences in public health
Event Type Fatalities Injuries
EXCESSIVE HEAT 1,797 6,393
TORNADO 1,511 20,667
FLASH FLOOD 887 1,674
LIGHTNING 651 4,141
RIP CURRENT 542 503
FLOOD 448 6,838
THUNDERSTORM WIND 382 5,154
EXTREME COLD/WIND CHILL 257 108
HIGH WIND 254 1,168
HEAT 240 1,309

Fatalities and Injuries Plot

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.1.3
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 3.1.3
## Loading required package: grid
# fatalities plot
fat.plot <- ggplot(fatalities[1:10,], aes(x = EVTYPE, y = FATALITIES)) + 
    theme(text = element_text(size=32, face = "bold")) +
    xlab("Event Type") + ylab("Fatalities") + geom_bar(stat="identity", fill = "red") +
    geom_text(aes(label = FATALITIES), hjust=1, size=9) + coord_flip() +
    theme(plot.margin = unit(c(2, 2, 2, 0), "lines"))

# injuries plot
inj.plot <- ggplot(injuries[1:10,], aes(x = EVTYPE, y = INJURIES)) + 
    theme(text = element_text(size=32, face = "bold")) +
    xlab("") + ylab("Injuries") + geom_bar(stat="identity", fill = "light blue") +
    geom_text(aes(label = INJURIES), hjust=1, size=9) + coord_flip()+
    theme(plot.margin = unit(c(2, 0, 2, 2), "lines"))

grid.arrange(fat.plot, inj.plot, ncol = 2,
            main=textGrob("Fatalities and Injuries by Event", 
                          gp=gpar(fontsize=50, col="blue")))
Figure 1: Fatalities and Injuries by Event Across US

The table and the plots shows that Excessive Heat causes the most fatalities (1,797).
Furthermore, Tornadoes causes the most injuries (20,667) and the second most fatalities (1,511).

Second question: Across the United States, which types of events have the greatest economic consequences?

Aggregate Damage (Crop and Property Damage)

# aggregate cropdamage
cropdamage <- aggregate(CROPDAMAGE ~ EVTYPE, clean.1996, sum)
cropdamage$CROPDAMAGE <- round(cropdamage$CROPDAMAGE, 2) 
cropdamage <- cropdamage[order(-cropdamage$CROPDAMAGE), ]
cropdamage$EVTYPE <- factor(cropdamage$EVTYPE,
                            levels = cropdamage[order(cropdamage$CROPDAMAGE),
                                                "EVTYPE"])
# aggregate propdamage
propdamage <- aggregate(PROPDAMAGE ~ EVTYPE, clean.1996, sum)
propdamage$PROPDAMAGE <- round(propdamage$PROPDAMAGE, 2) 
propdamage <- propdamage[order(-propdamage$PROPDAMAGE), ]
propdamage$EVTYPE <- factor(propdamage$EVTYPE,
                            levels = propdamage[order(propdamage$PROPDAMAGE),
                                                "EVTYPE"])
# aggregate total damage
damage <- aggregate(TOTAL ~ EVTYPE, clean.1996, sum)
damage$TOTAL <- round(damage$TOTAL, 2) 
damage <- damage[order(-damage$TOTAL), ]
damage$EVTYPE <- factor(damage$EVTYPE,
                            levels = damage[order(damage$TOTAL),
                                                "EVTYPE"])

Then we make some functions to simplify plotting multiple plots in one figure.

# define function to create multi-plot setup (nrow, ncol)
vp.setup <- function(x,y){
    # create a new layout with grid
    grid.newpage()
    
    # define viewports and assign it to grid layout
    pushViewport(viewport(layout = grid.layout(x,y)))
    }

# define function to easily access layout (row, col)
vp.layout <- function(x,y){
    viewport(layout.pos.row=x, layout.pos.col=y)
    }

Table of events types have the greatest economic consequences

After that, we make a table of ten most impactful in public health events

table_damage <- merge(merge(cropdamage, propdamage, 
                            by = 'EVTYPE', all.x = TRUE, all.y = TRUE),
                      damage, by = 'EVTYPE', all.x = TRUE, all.y = TRUE) 
table_damage <- format(table_damage, big.mark=",", decimal.mark=".")  
table_damage <- table_damage[order(table_damage$TOTAL, decreasing=TRUE), ]
kable(table_damage[1:10,], format = "pandoc",
      caption = "Events with greatest economic consequences (Millions $)", 
      row.names = FALSE,
      col.names=c("Event Type", "Property Damages", "Crop Damages", "Total"), 
      align=c("l", "r", "r", "r"))
Events with greatest economic consequences (Millions $)
Event Type Property Damages Crop Damages Total
HURRICANE/TYPHOON 5,350.11 81,718.89 87,069.00
STORM SURGE/TIDE 0.86 47,844.15 47,845.01
FLOOD 5,013.16 29,245.54 34,258.70
TORNADO 283.42 24,616.90 24,900.32
HAIL 2,496.82 14,595.12 17,091.94
FLASH FLOOD 1,334.90 15,222.20 16,557.10
DROUGHT 13,367.57 1,046.10 14,413.67
THUNDERSTORM WIND 1,016.95 7,913.86 8,930.81
TROPICAL STORM 677.71 7,642.47 8,320.18
WILDFIRE 402.26 7,760.45 8,162.70

Crop, Property and Total Damage

crop.plot <- ggplot(cropdamage[1:10,], aes(x = EVTYPE, y = CROPDAMAGE)) + 
    theme(text = element_text(size=50, face = "bold")) +
    xlab("Event Type") + ylab("Crop Damage (Million $)") + 
    geom_bar(stat="identity", fill = "green") +
    geom_text(aes(label = CROPDAMAGE), hjust=1, size=15) + coord_flip() +
    theme(plot.margin = unit(c(2, 0, 2, 0), "lines"))


prop.plot <- ggplot(propdamage[1:10,], aes(x = EVTYPE, y = PROPDAMAGE)) + 
    theme(text = element_text(size=52, face = "bold")) +
    xlab("") + ylab("Property Damage (Million $)") + 
    geom_bar(stat="identity", fill = "orange") +
    geom_text(aes(label = PROPDAMAGE), hjust=1, size=15) + coord_flip() +
    theme(plot.margin = unit(c(2, 0, 2, 0), "lines"))

total.plot <- ggplot(damage[1:10,], aes(x = EVTYPE, y = TOTAL)) + 
    theme(text = element_text(size=54, face = "bold")) +
    xlab("Event Type") + ylab("Total Damage (Million $)") + 
    geom_bar(stat="identity", fill = "royalblue3") +
    geom_text(aes(label = TOTAL), hjust=1, size=15) + coord_flip() +
    theme(plot.margin = unit(c(2, 0, 2, 0), "lines"))


par(mfrow=c(2,2))

vp.setup(2,2)

print(crop.plot, vp=vp.layout(1, 1))
print(prop.plot, vp=vp.layout(1, 2))
print(total.plot, vp=vp.layout(2, 1:2))
Figure 2: Crop, Property and Total Damage by Event Across US

The table and the plots shows that Drought causes the most crop damage (13,367.57 Million Dollars).
Furthermore, Hurricane/Typhoon causes the most property damage (81,718.89 Million Dollars) and the most total damage (87,069 Million Dollars).

Frequency of Most Harmful Event Types During the Year

Finally, we present the frequency for the most harmful event types during the year in order to observe when and how often these event types occur.
These event types, based on the tables and plots above, are: Drought, Excessive Heat, Flash Flood, Flood, Hail, Hurricane/Typhoon, Lightning, Storm Surge/Tide and Tornado.

library(plotrix)
## Warning: package 'plotrix' was built under R version 3.1.3
library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# function for evtype freq
evtype.freq <- function(evtype){
    ans <- subset(clean.1996, clean.1996$EVTYPE == evtype, select = c("BGN_DATE"))
    ans$MONTH <- as.numeric(format(as.Date(ans$BGN_DATE, 
                                           format="%m/%d/%Y %H:%M:%S"),"%m"))
    ans <- count(ans, MONTH)
    if (length(ans$n) == 12){
        return(ans$n)
    } else {
        temp <-numeric(length = 12)
        for (i in 1:length(ans$n)) {
            temp[ans$MONTH[i]] <- ans$n[i]
        }
    }
    return(temp)
}

# calculation of frequencies for each event type
drought <- evtype.freq("DROUGHT")
excessive.heat <- evtype.freq("EXCESSIVE HEAT")
flash.flood <- evtype.freq("FLASH FLOOD")
flood <- evtype.freq("FLOOD")
hail <- evtype.freq("HAIL")
hurricane.typhoon <- evtype.freq("HURRICANE/TYPHOON")
lightning <- evtype.freq("LIGHTNING")
storm.surge <- evtype.freq("STORM SURGE/TIDE")
tornado <- evtype.freq("TORNADO")

# function for plotting polar plot
polar.month.evtype <- function(evtype, name){
    testpos<-seq(0,350, by = 30)
    labels <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun",
                "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
    ans <- polar.plot(evtype, testpos, labels, testpos,
                      main = paste("Month Polar Plot of", name),
                      radial.lim=c(0,max(evtype)), rp.type= "p",
                      start=90, clockwise=TRUE, lwd=3, line.col = "red",
                      show.grid.labels = TRUE, boxed.radial = FALSE)
}

# make 3x3 grid
par(mfrow=c(3,3))

# plot each event types polar plot
drought.plot <- polar.month.evtype(drought, "Drought")
excessive.heat.plot <- polar.month.evtype(excessive.heat, "Excessive Heat")
flash.flood.plot <- polar.month.evtype(flash.flood, "Flash Flood")
flood.plot <- polar.month.evtype(flood, "Flood")
hail.plot <- polar.month.evtype(hail, "Hail")
hurricane.typhoon.plot <- polar.month.evtype(hurricane.typhoon, "Hurricane/Typhoon")
lightning.plot <- polar.month.evtype(lightning, "Lightning")
storm.surge.plot <- polar.month.evtype(storm.surge, "Storm Surge/Tide")
tornado.plot <- polar.month.evtype(tornado, "Tornado")
Figure 3: Frequency of Most Harmful Event Types During the Year