Introduction

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

Synopsis

In this report we aim to answer
1. Across the United States, which types of events are most harmful with respect to population health? and
2.Across the United States, which types of events have the greatest economic consequences?

To answer these questions we use weather events data provided by the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. From these data, we found that
1. Tornado’s has been the reason for maximum injuries ie 91,364 followed by thunderstorm wind,flood,excessive heat and lightning.
2.Flood has been the reason for maximum damage at USD 161.07B followed by hurricane(typhoon),tornado,stormsurge/tide and hail.

Loading the Raw Data

we obtained the data from

https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2

fileurl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
        download.file(fileurl,"full_data.csv.bz2")
        full_data <- read.csv("full_data.csv.bz2", header = TRUE)
        
        dim(full_data)
## [1] 902297     37
        str(full_data)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","  N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...
        head(full_data,3)
##   STATE__          BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE  EVTYPE
## 1       1 4/18/1950 0:00:00     0130       CST     97     MOBILE    AL TORNADO
## 2       1 4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL TORNADO
## 3       1 2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL TORNADO
##   BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## 1         0                                               0         NA
## 2         0                                               0         NA
## 3         0                                               0         NA
##   END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG
## 1         0                      14.0   100 3   0          0       15    25.0
## 2         0                       2.0   150 2   0          0        0     2.5
## 3         0                       0.1   123 2   0          0        2    25.0
##   PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE
## 1          K       0                                         3040      8812
## 2          K       0                                         3042      8755
## 3          K       0                                         3340      8742
##   LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1       3051       8806              1
## 2          0          0              2
## 3          0          0              3

Processing the data

Data required for analysis was subsetted from full data. The names are converted to lower case for ease of use. evtype and remarks converted to character for easy handling. bgn_date is converted to date format.

work_data <- full_data[,c(2,8,23:28,36)]
names(work_data) <- tolower(names(work_data))
work_data$evtype <- tolower(as.character(work_data$evtype))
work_data$bgn_date <- as.Date(as.character(work_data$bgn_date),format = "%m/%d/%Y")
work_data$remarks <- as.character(work_data$remarks)

There are 48 unique event types as per the documentation in Page 6 of https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf
The data has 898 unique events listed. A copy of the list from the documentation is made to a text file “Event_list.txt” and the same is read into R file event_list.The file is cleaned to remove the designator appended at the end using substr fn.

        event_list <- read.delim("Event_list.txt",header = FALSE)
        names(event_list) <- c("event")
        event_list$event <- tolower(as.character(event_list$event))
        event_list$event <- substr(event_list$event,1,nchar(event_list$event)-2)

An events vector is created by splitting the events in events_list for easy search.An events_value vector is created that contains the matched keywords of remarks column with events vector.

        library(glue)
## Warning: package 'glue' was built under R version 3.6.3
        events <- unique(unlist(strsplit(trim(event_list$event)," ")))
        events <- trim(gsub("[()]"," ",events))
        event_find <- sapply(events, grepl, work_data$remarks, ignore.case=TRUE)
        events_value <- apply(event_find, 1, function(u) paste(names(which(u)), collapse=","))
        work_data$events_value <- events_value

Recoding/Regrouping Data

We work with work_data for any mis-spelt, or similar event types and group them based on events_list. All items that could not be grouped under events_list are grouped as others.

        work_data$evtype[grep("winter weather",work_data$evtype)] <- "winter weather"
        work_data$evtype[grep("(ry|er)\\s(mix)$",work_data$evtype)] <- "others"
        work_data$evtype[grep("fire",work_data$evtype)] <- "wildfire"
        work_data$evtype[grep("^(thun|tun)",work_data$evtype)] <- "thunderstorm wind"
        work_data$evtype[grep("^(\\ststm|tstm)",work_data$evtype)] <- "thunderstorm wind" #tstm is an abbreviation used for thunderstorm
        
        
        work_data$evtype[grep("urban",work_data$evtype)] <- "flood"
        work_data$evtype[grep("tropical storm",work_data$evtype)] <- "tropical storm"
        work_data$evtype[grep("^extreme\\s(c|w)",work_data$evtype)] <- "extreme cold/wind chill"
        work_data$evtype[grep("astro",work_data$evtype)] <- "astronomical low tide"
        work_data$evtype[grep("flash",work_data$evtype)] <- "flash flood"
        
        
        work_data$evtype[grep("stream",work_data$evtype)] <- "flood"
        work_data$evtype[grep("spout",work_data$evtype)] <- "waterspout"
        work_data$evtype[grep("torn",work_data$evtype)] <- "tornado"
        work_data$evtype[grep("bliz",work_data$evtype)] <- "blizzard"
        work_data$evtype[grep("heavy\\ssnow",work_data$evtype)] <- "heavy snow"
        work_data$evtype[grep("^(snow)",work_data$evtype)] <- "heavy snow"
        
       
        work_data$evtype[grep("^(hail)",work_data$evtype)] <- "hail"
        work_data$evtype[grep("aval",work_data$evtype)] <- "avalanche"
        work_data$evtype[grep("^high\\swin",work_data$evtype)] <- "high wind"
        work_data$evtype[grep("^heavy\\srain",work_data$evtype)] <- "heavy rain"
        work_data$evtype[grep("drought",work_data$evtype)] <- "drought"
        
        work_data$evtype[grep("^heat",work_data$evtype)] <- "heat"
        work_data$evtype[grep("^(extreme|(record|record/excessive)|excessive)\\sheat",work_data$evtype)] <- "excessive heat"
        work_data$evtype[grep("rip",work_data$evtype)] <- "rip current"
        work_data$evtype[grep("dense\\sfog",work_data$evtype)] <- "dense fog"
        work_data$evtype[grep("^fog",work_data$evtype)] <- "dense fog"
        
        work_data$evtype[grep("ce\\sfog",work_data$evtype)] <- "freezing fog"
        work_data$evtype[grep("sleet",work_data$evtype)] <- "sleet"
        work_data$evtype[grep("hurri",work_data$evtype)] <- "hurricane(typhoon)"
        work_data$evtype[grep("ice\\sstorm",work_data$evtype)] <- "ice storm"
        work_data$evtype[grep("warmth",work_data$evtype)] <- "excessive heat"
        
        work_data$evtype[grep("warm",work_data$evtype)] <- "heat"
        work_data$evtype[grep("cloud",work_data$evtype)] <- "funnel cloud"
        work_data$evtype[grep("^cold",work_data$evtype)] <- "cold/wind chill"
        work_data$evtype[grep("\\scold",work_data$evtype)] <- "extreme cold/wind chill"
        work_data$evtype[grep("^strong|[/]strong",work_data$evtype)] <- "strong wind"
        
        work_data$evtype[grep("record\\sdry",work_data$evtype)] <- "excessive heat"
        work_data$evtype[grep("^dry",work_data$evtype)] <- "heat"
        work_data$evtype[grep("(coastal|cstl|coastal/tidal)\\sflood",work_data$evtype)] <- "coastal flood"
        work_data$evtype[grep("^flood|((ay|er|et)\\sflood)",work_data$evtype)] <- "flood"
        work_data$evtype[grep("marine\\ststm\\swind",work_data$evtype)] <- "marine thunderstorm wind"
        
        work_data$evtype[grep("thuder",work_data$evtype)] <- "thunderstorm wind"
        work_data$evtype[grep("(nt|ty|st)\\swind",work_data$evtype)] <- "high wind"
        work_data$evtype[grep("^wind\\schill",work_data$evtype)] <- "cold/wind chill"
        work_data$evtype[grep("^wind",work_data$evtype)] <- "high wind"
        work_data$evtype[grep("surf",work_data$evtype)] <- "high surf"
        
        work_data$evtype[grep("wet|rain|preci",work_data$evtype)] <- "heavy rain"
        work_data$evtype[grep("volc",work_data$evtype)] <- "volcanic ash"
        work_data$evtype[grep("record\\stemp",work_data$evtype)] <- "excessive heat"
        work_data$evtype[grep("^temp|high\\stemp",work_data$evtype)] <- "excessive heat"
        work_data$evtype[grep("^record\\s(co|lo|ma|sn|wi)",work_data$evtype)] <- "extreme cold/wind chill"
        
        work_data$evtype[grep("cool",work_data$evtype)] <- "cold/wind chill"
        work_data$evtype[grep("slide",work_data$evtype)] <- "debris flow" # see document for  change in classification 
        work_data$evtype[grep("ll\\shail",work_data$evtype)] <- "hail" 
        work_data$evtype[grep("hypo|low\\stemp",work_data$evtype)] <- "extreme cold/wind chill" 
        work_data$evtype[grep("seas$",work_data$evtype)] <- "high surf" 
        
        work_data$evtype[grep("\\sfree",work_data$evtype)] <- "frost/freeze"
        work_data$evtype[grep("freezing\\s(dr|sp)",work_data$evtype)] <- "frost/freeze"
        work_data$evtype[grep("freeze",work_data$evtype)] <- "frost/freeze"
        
        work_data$evtype[grep("[?]|beach|driest|first|glaze|gustnado|mix|high\\swater|icy\\sroad|light\\ssnow|monthly|none|northern|other|south",work_data$evtype)] <- "others" 
        work_data$evtype[grep("dust\\sde",work_data$evtype)] <- "dust devil" 
        work_data$evtype[grep("blowing\\ssnow",work_data$evtype)] <- "heavy snow" 
        work_data$evtype[grep("frost",work_data$evtype)] <- "frost/freeze" 
        
        work_data$evtype[grep("funnel",work_data$evtype)] <- "funnel cloud" 
        work_data$evtype[grep("^light(n|i)",work_data$evtype)] <- "lightning"
        work_data$evtype[grep("winter\\sstorm",work_data$evtype)] <- "winter storm"
        work_data$evtype[grep("light",work_data$evtype)] <- "lightning"
        work_data$evtype[grep("coast",work_data$evtype)] <- "coastal flood"
        work_data$evtype[grep("bitter",work_data$evtype)] <- "extreme cold/wind chill"
        
        
        work_data$evtype[grep("^gusty\\sthu",work_data$evtype)] <- "thunderstorm wind"
        work_data$evtype[grep("tstm",work_data$evtype)] <- "high wind"
        work_data$evtype[grep("severe\\sthu",work_data$evtype)] <- "thunderstorm wind"
        work_data$evtype[grep("^high\\s(\\s|(s|wa|ti))",work_data$evtype)] <- "high surf"
        work_data$evtype[grep("\\sdust",work_data$evtype)] <- "dust storm"
        
        work_data$evtype[grep("dust(\\s|st)[^d]",work_data$evtype)] <- "dust storm"
        work_data$evtype[grep("^low\\swi",work_data$evtype)] <- "cold/wind chill"
        work_data$evtype[grep("(up|cal|am|or|ral)\\sflood",work_data$evtype)] <- "flood"
        work_data$evtype[grep("e\\sflood",work_data$evtype)] <- "lakeshore flood"
        work_data$evtype[grep("dal\\sflood",work_data$evtype)] <- "coastal flood"
        
        work_data$evtype[grep("uns",work_data$evtype)] <- "heat"
        work_data$evtype[grep("(ep|on|re)\\shail",work_data$evtype)] <- "hail"
        work_data$evtype[grep("lake(\\s|[-])(ef|sn)",work_data$evtype)] <- "lake-effect snow" 
        work_data$evtype[grep("(ed|ve|rd)\\ssnow",work_data$evtype)] <- "heavy snow" 
        work_data$evtype[grep("(ly|ry)\\sdry",work_data$evtype)] <- "excessive heat"
        
        work_data$evtype[grep("storm\\ssurge",work_data$evtype)] <-  "storm surge/tide"
        work_data$evtype[grep("(^storm\\sfo)|(^metro)",work_data$evtype)] <-  "tropical storm"
        work_data$evtype[grep("lig",work_data$evtype)] <-  "lightning"
        work_data$evtype[grep("heavy\\ssh",work_data$evtype)] <-  "heavy rain"
        work_data$evtype[grep("heavy\\ssw",work_data$evtype)] <-  "heavy surf"
        
        work_data$evtype[grep("hot",work_data$evtype)] <-  "heat"
        work_data$evtype[grep("surf",work_data$evtype)] <-  "high surf"
        work_data$evtype[grep("wn|wak|vog|unu|tur|season|rog|remn|rec|rap|patc|non|mod|mil|micr|late|land|gusty|fall|ear|drif|down|blow",work_data$evtype)] <-  "others"
        work_data$evtype[grep("ice\\s[^(st)]",work_data$evtype)] <-  "others"

We now work on cases where evtype is “summary….”.We find that majority has key words (in remarks column) thunderstorm and tornado. We group them to respective groups. where both are present we group it under tornado. One case was grouped under “flash flood” as it contains both keywords “flash” and “flood”.

         summ_list <- grep("summ",work_data$evtype)
         summ_tor <- grep("tornado",work_data$events_value[summ_list])
         work_data$evtype[summ_list[summ_tor]] <- "tornado"
         summ_list <- grep("summ",work_data$evtype)
         summ_thun <- grep("thun",work_data$events_value[summ_list])
         work_data$evtype[summ_list[summ_thun]] <- "thunderstorm wind"
         summ_list <- grep("summ",work_data$evtype)
         work_data$events_value[summ_list]
## [1] "low,flood,storm,flash,heavy,rain,ash"
         work_data$evtype[summ_list] <- "flash flood"

We now examine few cases where reference to events_value is made ,to decide on the right grouping for evtype.The evtype was not helpful in deciding the group.

          work_data$events_value[work_data$evtype == " wind"]
## [1] "low,wind,strong"
          work_data$evtype[grep("^\\swind",work_data$evtype)] <-  "strong wind"
         
          work_data$events_value[work_data$evtype == "apache county"]
## [1] "storm,wind,thunderstorm"
          work_data$evtype[grep("apache",work_data$evtype)] <-  "thunderstorm wind"
          
         work_data$events_value[work_data$evtype == "black ice"]
##  [1] "excessive,freezing,ice"            "freezing,ice"                     
##  [3] "low,flow,surf,ice"                 "low,flow,surf,ice"                
##  [5] "ice"                               "ice"                              
##  [7] "low,flow,surf,ice,ash"             "rain,ice,ash"                     
##  [9] "low,freezing,rain,snow,ice,strong" "rain,ice,ash"                     
## [11] "low,fog,freezing,wind,ice"         "dense,fog,freezing,ice"           
## [13] "ice"                               "ice"                              
## [15] "low,freezing,heavy,ice"            "ice"                              
## [17] "ice"
         work_data$evtype[grep("black",work_data$evtype)] <-  "frost/freeze"
         
         work_data$evtype[grep("dam",work_data$evtype)] <-  "flood"
         
         work_data$events_value[work_data$evtype == "excessive"]
## [1] "rain,weather"
         work_data$evtype[work_data$evtype == "excessive"] <-  "heavy rain"
         
         work_data$events_value[work_data$evtype == "high"]
## [1] "high,wind"
         work_data$evtype[work_data$evtype == "high"] <-  "high wind"
         
         work_data$events_value[work_data$evtype == "hyperthermia/exposure"]
## [1] "low"
         work_data$evtype[work_data$evtype == "hyperthermia/exposure"] <-  "extreme cold/wind chill"
         
         work_data$events_value[work_data$evtype == "ice"]
##  [1] "ice"                                           
##  [2] "rain,ice"                                      
##  [3] "low,freezing,rain,ice,sleet"                   
##  [4] "low,heavy,high,wind,ice"                       
##  [5] "low,heavy,high,wind,ice"                       
##  [6] "low,heavy,high,wind,ice"                       
##  [7] "low,heavy,high,wind,ice"                       
##  [8] "low,heavy,high,wind,ice"                       
##  [9] "low,heavy,high,wind,ice"                       
## [10] "low,heavy,high,wind,ice"                       
## [11] "low,heavy,high,wind,ice"                       
## [12] "low,heavy,high,wind,ice"                       
## [13] "low,heavy,high,wind,ice"                       
## [14] ""                                              
## [15] ""                                              
## [16] ""                                              
## [17] ""                                              
## [18] "freezing,rain,ice"                             
## [19] "freezing,rain,sleet"                           
## [20] ""                                              
## [21] ""                                              
## [22] "low,storm,freezing,rain,surf,ice,sleet"        
## [23] "freezing,rain,sleet"                           
## [24] "freezing,rain,high,ice,sleet"                  
## [25] "low,freezing,ice"                              
## [26] "freezing,ice"                                  
## [27] "freezing,rain,sleet"                           
## [28] "freezing,rain,ice"                             
## [29] "rain,ice"                                      
## [30] "ice,strong"                                    
## [31] "low,storm,freezing,heavy,rain,ice,sleet,winter"
## [32] "ice"                                           
## [33] "ice"                                           
## [34] "freezing,rain,snow,ice"                        
## [35] "flood,freezing,rain,ice"                       
## [36] "fog,freezing"                                  
## [37] "freezing,rain,sleet"                           
## [38] "low,storm,freezing,rain,surf,ice,sleet"        
## [39] "storm,freezing,rain,ice,strong,sleet"          
## [40] "freezing,rain,high,sleet"                      
## [41] "freezing,rain,ice"                             
## [42] "freezing,rain,ice"                             
## [43] ""                                              
## [44] ""                                              
## [45] ""                                              
## [46] ""                                              
## [47] ""                                              
## [48] ""                                              
## [49] ""                                              
## [50] ""                                              
## [51] "freezing,rain,snow,sleet"                      
## [52] "high"                                          
## [53] "high"                                          
## [54] "fog,high"                                      
## [55] "fog,ice"                                       
## [56] "ice"                                           
## [57] "high,ice"                                      
## [58] ""                                              
## [59] "freezing,rain,ice"                             
## [60] "ice"                                           
## [61] "freezing,rain,wind,ice"
         work_data$evtype[work_data$evtype == "ice"] <- "frost/freeze"
         
         work_data$events_value[work_data$evtype == "ice/snow"]
## [1] "low,freezing,rain,snow,ice,sleet"                             
## [2] "low,freezing,rain,snow,sleet"                                 
## [3] "low,freezing,rain,snow,high,sleet,ash"                        
## [4] "storm,extreme,freezing,rain,snow,high,wind,ice,winter,weather"
## [5] "extreme,freezing,rain,snow,wind,ice,winter"
         work_data$evtype[work_data$evtype == "ice/snow"] <- "frost/freeze"
         
         work_data$events_value[work_data$evtype == "lack of snow"]
## [1] "snow"
         work_data$evtype[work_data$evtype == "lack of snow"] <- "others"
         
         work_data$events_value[work_data$evtype == "marine accident"]
## [1] "heavy"
         work_data$evtype[work_data$evtype == "marine accident"] <- "high surf"
         
         work_data$events_value[work_data$evtype == "marine mishap"]
## [1] "chill,wind,marine" "fog,heavy,wind"
         work_data$evtype[work_data$evtype == "marine mishap"] <- "high surf"
         
         work_data$events_value[work_data$evtype == "mountain snows"]
## [1] "storm,snow,high"
         work_data$evtype[work_data$evtype == "mountain snows"] <- "ice storm"
         
         work_data$events_value[work_data$evtype == "no severe weather"]
## [1] "weather"
         work_data$evtype[work_data$evtype == "no severe weather"] <- "others"
         
          work_data$events_value[work_data$evtype == "red flag criteria"]
## [1] "rain,high,lightning,weather" "lightning,weather"
          work_data$evtype[work_data$evtype == "red flag criteria"] <- "lightning"
          
          work_data$events_value[work_data$evtype == "smoke"]
##  [1] "smoke,wildfire"                                                   
##  [2] "smoke,wildfire"                                                   
##  [3] "low,dense,smoke,wildfire"                                         
##  [4] "low,flow,smoke,wind"                                              
##  [5] "low,smoke,storm,high,surf,wind,ice,lightning,thunderstorm,weather"
##  [6] "low,flow,smoke,wind"                                              
##  [7] "low,smoke,storm,high,surf,wind,ice,lightning,thunderstorm,weather"
##  [8] "low,flow,smoke,wind"                                              
##  [9] "low,smoke,storm,high,surf,wind,ice,lightning,thunderstorm,weather"
## [10] "low,flow,smoke,wind"                                              
## [11] "low,smoke,storm,high,surf,wind,ice,lightning,thunderstorm,weather"
          work_data$evtype[work_data$evtype == "smoke"] <- "wildfire"
          
          work_data$events_value[work_data$evtype == "typhoon"]
##  [1] "storm,heavy,rain,wind,strong,tropical"                                                             
##  [2] "storm,tropical,depression"                                                                         
##  [3] "storm, typhoon ,ice,tropical,depression,waterspout,weather"                                        
##  [4] "low,flood,storm,heavy,rain,high,wind, typhoon ,ice,strong,thunderstorm,tropical,depression,weather"
##  [5] "storm,wind, typhoon ,ice,tropical,depression,weather"                                              
##  [6] "low,flood,flow,storm,rain,high,wind, typhoon ,ice,tropical,weather"                                
##  [7] "low,flood,flow,storm,rain,high,wind, typhoon ,ice,tropical,weather"                                
##  [8] "flood,debris,storm,rain,surf,wind, typhoon ,strong"                                                
##  [9] "low,storm,heavy,rain,high,wind, typhoon ,ice,tropical,weather"                                     
## [10] "low,storm,heavy,rain,high,wind, typhoon ,ice,tropical,weather"                                     
## [11] "low,storm,heavy,rain,high,wind, typhoon ,ice,tropical,weather"
          work_data$evtype[work_data$evtype == "typhoon"] <- "hurricane(typhoon)"
          
          work_data$events_value[work_data$evtype == "whirlwind"]
## [1] "dust,devil"                 "dust,devil,wind,strong,rip"
## [3] "wind,strong"
         work_data$evtype[work_data$evtype == "whirlwind"] <- "dust devil"
         sort(unique(work_data$evtype))
##  [1] "astronomical low tide"    "avalanche"               
##  [3] "blizzard"                 "coastal flood"           
##  [5] "cold/wind chill"          "debris flow"             
##  [7] "dense fog"                "dense smoke"             
##  [9] "drought"                  "dust devil"              
## [11] "dust storm"               "excessive heat"          
## [13] "extreme cold/wind chill"  "flash flood"             
## [15] "flood"                    "freezing fog"            
## [17] "frost/freeze"             "funnel cloud"            
## [19] "hail"                     "heat"                    
## [21] "heavy rain"               "heavy snow"              
## [23] "high surf"                "high wind"               
## [25] "hurricane(typhoon)"       "ice storm"               
## [27] "lake-effect snow"         "lakeshore flood"         
## [29] "lightning"                "marine hail"             
## [31] "marine high wind"         "marine strong wind"      
## [33] "marine thunderstorm wind" "others"                  
## [35] "rip current"              "seiche"                  
## [37] "sleet"                    "storm surge/tide"        
## [39] "strong wind"              "thunderstorm wind"       
## [41] "tornado"                  "tropical depression"     
## [43] "tropical storm"           "tsunami"                 
## [45] "volcanic ash"             "waterspout"              
## [47] "wildfire"                 "winter storm"            
## [49] "winter weather"

We have ended up with 49 unique evtype. One more than the documented list is due to addition of evtype = “others”. The percentage of observations classified as “others” is negligible at 0.07%

Now we look at the values in columns “fatalities” and “injuries”. There are 0 NAs in “fatalities” and 0 NAs in “injurues” fields.

We also look at summary statistics for those fields and also examine how many observations have non-zero “fatalities” or “injuries”

summary(work_data$fatalities)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   0.0000   0.0000   0.0000   0.0168   0.0000 583.0000
summary(work_data$injuries)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##    0.0000    0.0000    0.0000    0.1557    0.0000 1700.0000
sum(work_data$fatalities > 0)
## [1] 6974
sum(work_data$injuries > 0)
## [1] 17604

Results - 1

Now we are ready to answer the first question

1. Across the United States, which types of events are most harmful with respect to population health?

To answer this question we pick the Top five events that cause maximum injuries to people.

        library(dplyr)
## Warning: package 'dplyr' was built under R version 3.6.3
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:glue':
## 
##     collapse
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
        library(knitr)
## Warning: package 'knitr' was built under R version 3.6.3
        inj_data <- subset(work_data,subset = work_data$injuries > 0 ,select = c("evtype","injuries"))
        
        result1 <- inj_data %>% group_by(evtype) %>% summarise(injuries = sum(injuries)) %>% arrange(desc(injuries)) %>% top_n(n=5,wt = injuries)
        kable(result1)
evtype injuries
tornado 91364
thunderstorm wind 9509
flood 6873
excessive heat 6730
lightning 5232
        library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.6.3
        result1$evtype <- factor(result1$evtype, levels=c("tornado", "thunderstorm wind", "flood","excessive heat","lightning"))
        ggplot(data = result1,aes(evtype,injuries/1000)) + geom_col(color = "red")+ labs(x = "events",y = "Injured ('000)",title = "Top five events - Injuries")

Further analysis to answer the second question

In order to answer the question on “events with great economic consequence” , we need to look at propdmg,propdmgexp for amount of property damage and cropdmg ,cropdmgexp for the amount of crop damage. We agree that injuries and deaths do have economic consequences but the same is not considered here.

The field propdmg has 0 NAs and the field cropdmg has 0 NAs. The summary statistics of propdmg and cropdmg is examined.

        summary(work_data$propdmg)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    0.00   12.06    0.50 5000.00
        summary(work_data$cropdmg)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   1.527   0.000 990.000

The values for propdmg grouped by propdmgexp is examined

        work_data %>% select(propdmg,propdmgexp) %>% group_by(propdmgexp) %>% summarise(count = n(),sum = sum(propdmg))
## # A tibble: 19 x 3
##    propdmgexp  count        sum
##    <fct>       <int>      <dbl>
##  1 ""         465934      527. 
##  2 "-"             1       15  
##  3 "?"             8        0  
##  4 "+"             5      117  
##  5 "0"           216     7108. 
##  6 "1"            25        0  
##  7 "2"            13       12  
##  8 "3"             4       20  
##  9 "4"             4       14.5
## 10 "5"            28      210. 
## 11 "6"             4       65  
## 12 "7"             5       82  
## 13 "8"             1        0  
## 14 "B"            40      276. 
## 15 "h"             1        2  
## 16 "H"             6       25  
## 17 "K"        424665 10735292. 
## 18 "m"             7       38.9
## 19 "M"         11330   140694.

Looking at the above table, less than 100 obs are with numerical factors and hence can be ignored.We wil be pushing zero for all cases other than B,b,M,m,K,k For cases with b,B,M,m,K,k we will be pushing the respective exponential powers for easy calculations.Therefore
B,b will be 9
M,m will be 6
K,k will be 3 and we create a column “propcost” = propdmg * (10^propdmgexp) indicating the total cost due to property damage.

        work_data$propdmgexp <- tolower(as.character(work_data$propdmgexp))
        work_data$propdmgexp[grep("[0-9]|[+?]",work_data$propdmgexp)] <- "0"
        work_data$propdmgexp[grep("[-h]",work_data$propdmgexp)] <- "0"
        work_data$propdmgexp[grep("k",work_data$propdmgexp)] <- "3"
        work_data$propdmgexp[grep("m",work_data$propdmgexp)] <- "6"
        work_data$propdmgexp[grep("b",work_data$propdmgexp)] <- "9"
        
        work_data$propdmgexp <- as.integer(work_data$propdmgexp)
        work_data$propdmgexp[is.na(work_data$propdmgexp)] <- 0
        work_data <- work_data %>% mutate(propcost = propdmg * (10^propdmgexp)) 

Now we examine the values for cropdmg grouped by cropdmgexp

        work_data %>% select(cropdmg,cropdmgexp) %>% group_by(cropdmgexp) %>% summarise(count = n(),sum = sum(cropdmg))
## # A tibble: 9 x 3
##   cropdmgexp  count       sum
##   <fct>       <int>     <dbl>
## 1 ""         618413      11  
## 2 "?"             7       0  
## 3 "0"            19     260  
## 4 "2"             1       0  
## 5 "B"             9      13.6
## 6 "k"            21     436  
## 7 "K"        281832 1342956. 
## 8 "m"             1      10  
## 9 "M"          1994   34141.

Looking at the above table, 20 obs are with numerical factors and hence can be ignored.We wil be pushing zero for all cases other than B,b,M,m,K,k For cases with b,B,M,m,K,k we will be pushing the respective exponential powers for easy calculations.Therefore
B,b will be 9
M,m will be 6
K,k will be 3 and we create a column “cropcost” = cropdmg * (10^cropdmgexp) indicating the total cost due to crop damage.

        work_data$cropdmgexp <- tolower(as.character(work_data$cropdmgexp))
        work_data$cropdmgexp[grep("[0-9]|[?]",work_data$cropdmgexp)] <- "0"
        
        work_data$cropdmgexp[grep("k",work_data$cropdmgexp)] <- "3"
        work_data$cropdmgexp[grep("m",work_data$cropdmgexp)] <- "6"
        work_data$cropdmgexp[grep("b",work_data$cropdmgexp)] <- "9"
        
        work_data$cropdmgexp <- as.integer(work_data$cropdmgexp)
        work_data$cropdmgexp[is.na(work_data$cropdmgexp)] <- 0
        work_data <- work_data %>% mutate(cropcost = cropdmg * (10^cropdmgexp)) 

Results - 2

Now we combine the propcost and cropcost into ecocost which we will use to answer the second question.
2.Across the United States, which types of events have the greatest economic consequences?
We pick the Top five events that cause maximum damage by cost.

        work_data <- work_data %>% mutate(ecocost = propcost + cropcost)

        cost_data <- subset(work_data,subset = work_data$ecocost > 0 ,select = c("evtype","ecocost"))
        
        result2 <- cost_data %>% group_by(evtype) %>% summarise(ecocost_USD_Billion = sum(ecocost)/(10^9)) %>% arrange(desc(ecocost_USD_Billion)) %>% top_n(n=5,wt = ecocost_USD_Billion)
        kable(result2)
evtype ecocost_USD_Billion
flood 161.06865
hurricane(typhoon) 90.87253
tornado 58.95939
storm surge/tide 47.96558
hail 19.02143
        result2$evtype <- factor(result2$evtype, levels=c("flood","hurricane(typhoon)","tornado", "storm surge/tide", "hail"))
        ggplot(data = result2,aes(evtype,ecocost_USD_Billion)) + geom_col(color = "blue")+ labs(x = "events",y = "Total damage (B USD)",title = "Top five events - Damage")

```