The Most Alarming Weather Events

1. Synopsys

This is the second project for Reproducible Research course, which is a part of Coursera’s Data Science Specialization, and the final course of Coursera’s Data Science: Foundations Using R Specialization.

The report aims to find out which types of severe weather events are most harmful to population health and have the most significant economic consequences. It can be used for proper government preparation to such events so that to prioritise resources and prevent destructive outcomes to the extent possible.

To investigate this issue, I’ve explored the NOAA storm database, which was collected from The National Weather Service, as well as the media, law enforcement, other government agencies, private companies, individuals. The database tracks characteristics of most massive storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities/ injuries, and property/ crops damage.

The main conclusion is that across the U.S.,

Tornado causes the most sizable number of deaths/ injuries,
while Hurricane (Typhoon) results in most considerable damage to property and crops.

2. Data Processing

The original data was obtained from the U.S. National Oceanic and Atmospheric Administration (NOAA site).

The events in the database for assignment start in the year $1950$ and end in November $2011$. Though, as data collection procedures have changed, the unique event types the data was being collected for have changed over years, and more event types were added over time. More information on the database (i.g. descriptions of event types, data set variables, as well as how the variables are constructed) can be found at:

National Weather Service revised Storm Data Codebook—2018
- and the old one (for 2007, accompanying the assignment)
National Climatic Data Center Storm Events FAQ

Code chunks can be displayed by clicking Code button
Functions constructed for the assignment can be found in the Appendix

2.1 Loading, reading, examination data

The raw data has come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. So, it needs to be download, decompress, and then read to storm data.table.

Also, to sort out some issues the “test set” (stormTest), consisting of updated data for $1993-1995,\ 2001,\ 2004-2006$ years, was made up. The revised versions for these years are available on the page Bulk Data Download (choose there ’Access: HTTP’). Download and read it all.

library(data.table); library(dplyr); library(tidyr); library(quantmod)
library(ggplot2); library(gridExtra); library(plotly)
library(DT); library(splitstackshape)
loader()
storm <- fread("./data/1119_DS-RR-w4_Storm/storm.csv")
stormTest<- lapply(list("./data/1119_DS-RR-w4_Storm/storm1993.csv",
                        "./data/1119_DS-RR-w4_Storm/storm1994.csv",
                        "./data/1119_DS-RR-w4_Storm/storm1995.csv",
                        "./data/1119_DS-RR-w4_Storm/storm2001.csv",
                        "./data/1119_DS-RR-w4_Storm/storm2004.csv",
                        "./data/1119_DS-RR-w4_Storm/storm2005.csv",
                        "./data/1119_DS-RR-w4_Storm/storm2006.csv"), fread)
stormTest <- rbindlist(stormTest)

Examine the structure of the storm data, and the columns names of stormTest:

str(storm)

Classes 'data.table' and 'data.frame':  902297 obs. of  37 variables:
 $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
 $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
 $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
 $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
 $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
 $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
 $ STATE     : chr  "AL" "AL" "AL" "AL" ...
 $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
 $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
 $ BGN_AZI   : chr  "" "" "" "" ...
 $ BGN_LOCATI: chr  "" "" "" "" ...
 $ END_DATE  : chr  "" "" "" "" ...
 $ END_TIME  : chr  "" "" "" "" ...
 $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
 $ COUNTYENDN: logi  NA NA NA NA NA NA ...
 $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
 $ END_AZI   : chr  "" "" "" "" ...
 $ END_LOCATI: chr  "" "" "" "" ...
 $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
 $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
 $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
 $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
 $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
 $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
 $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
 $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
 $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ CROPDMGEXP: chr  "" "" "" "" ...
 $ WFO       : chr  "" "" "" "" ...
 $ STATEOFFIC: chr  "" "" "" "" ...
 $ ZONENAMES : chr  "" "" "" "" ...
 $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
 $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
 $ LATITUDE_E: num  3051 0 0 0 0 ...
 $ LONGITUDE_: num  8806 0 0 0 0 ...
 $ REMARKS   : chr  "" "" "" "" ...
 $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...
 - attr(*, ".internal.selfref")=<externalptr>

names(stormTest)

 [1] "BEGIN_YEARMONTH"    "BEGIN_DAY"          "BEGIN_TIME"        
 [4] "END_YEARMONTH"      "END_DAY"            "END_TIME"          
 [7] "EPISODE_ID"         "EVENT_ID"           "STATE"             
[10] "STATE_FIPS"         "YEAR"               "MONTH_NAME"        
[13] "EVENT_TYPE"         "CZ_TYPE"            "CZ_FIPS"           
[16] "CZ_NAME"            "WFO"                "BEGIN_DATE_TIME"   
[19] "CZ_TIMEZONE"        "END_DATE_TIME"      "INJURIES_DIRECT"   
[22] "INJURIES_INDIRECT"  "DEATHS_DIRECT"      "DEATHS_INDIRECT"   
[25] "DAMAGE_PROPERTY"    "DAMAGE_CROPS"       "SOURCE"            
[28] "MAGNITUDE"          "MAGNITUDE_TYPE"     "FLOOD_CAUSE"       
[31] "CATEGORY"           "TOR_F_SCALE"        "TOR_LENGTH"        
[34] "TOR_WIDTH"          "TOR_OTHER_WFO"      "TOR_OTHER_CZ_STATE"
[37] "TOR_OTHER_CZ_FIPS"  "TOR_OTHER_CZ_NAME"  "BEGIN_RANGE"       
[40] "BEGIN_AZIMUTH"      "BEGIN_LOCATION"     "END_RANGE"         
[43] "END_AZIMUTH"        "END_LOCATION"       "BEGIN_LAT"         
[46] "BEGIN_LON"          "END_LAT"            "END_LON"           
[49] "EPISODE_NARRATIVE"  "EVENT_NARRATIVE"    "DATA_SOURCE"

rm(loader)

There are $902\ 297$ records on $37$ variables in the storm raw data, and $51$ variables in the stormTest set.

2.2 Removing duplicates, extracting variables of analysis interest

First, deal with duplicates.

duo<- select(storm, -c(REFNUM,REMARKS)) %>% duplicated()
storm <- storm[!duo,]
duostorm<-sum(duo) # 3954
duo<- select(stormTest, -c(EVENT_ID,EVENT_NARRATIVE)) %>% duplicated()
stormTest <- stormTest[!duo,]
duotest<-sum(duo) # 426

Disregarding the id-variable REFNUM, the storm data contains 3954 duplicates, and the stormTest set $-$ 426. The duplicated observations were removed.

Then, subset the storm data set on the parameters of interest. Exploring the structure of it, select following main variables for answering the questions:

Health variables (to estimate the public health impact for event types):
- FATALITIES/ INJURIES: deaths and injuries estimated for the event,
Economic variables (to estimate the economic impact for event type):
- PROPDMG/ CROPDMG: cost of property (buildings, power lines, vehicles)/ crops (crop, grain bins, cows killed) damage,
- PROPDMGEXP/ CROPDMGEXP: the monetary units used to interpret the numeric values of property/ crops damage
Target variable Event Types:
- EVTYPE: weather event type (TORNADO, THUNDERSTORM WIND, EXCESSIVE HEAT, DROUGHT, etc.)

Some helper variables will be needed too, viz: REFNUM (id-variable), BGN_DATE (the date event/ public impact from the event begins), STATE, COUNTY (event location, county forecast zone), REMARKS (event narrative).

(Corresponding stormTest variables:

DEATHS_DIRECT+DEATHS_INDIRECT, INJURIES_DIRECT+INJURIES_INDIRECT, DAMAGE_PROPERTY, DAMAGE_CROPS (final values, w/o multipliers), EVENT_TYPE;
EVENT_ID, BEGIN_DATE_TIME, STATE, CZ_FIPS, EVENT_NARRATIVE.)

Extract all of the above variables from both storm and stormTest, and also:

unify variables names (except for REFNUM/ EVENT_ID and DAMAGE_PROPERTY/ PROPDMG, DAMAGE_CROPS/ CROPDMG)
convert the character BGN_DATE variable to iDate class,
define the YEAR variable.

storm <- storm[,BGN_DATE:= as.IDate(BGN_DATE, "%m/%d/%Y %H:%M:%S")][
        ,.(REFNUM, YEAR= year(BGN_DATE), BGN_DATE, EVTYPE, STATE,C_ZONE= COUNTY,
           DEATHS=FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG,
           CROPDMGEXP, REMARKS)]
stormTest<-stormTest[,BGN_DATE:=as.IDate(BEGIN_DATE_TIME,"%d-%b-%y %H:%M:%S")][
        ,.(EVENT_ID, YEAR= year(BGN_DATE), BGN_DATE, EVTYPE= EVENT_TYPE,
           STATE, C_ZONE= CZ_FIPS, DEATHS= DEATHS_DIRECT + DEATHS_INDIRECT,
           INJURIES= INJURIES_DIRECT + INJURIES_INDIRECT,
           DAMAGE_PROPERTY, DAMAGE_CROPS, REMARKS= EVENT_NARRATIVE)]

Check, if there are any missing (NA) values in the subset data:

anyNA(storm,stormTest)

[1] FALSE

and get acquainted with storm updated variables:

names(storm)

 [1] "REFNUM"     "YEAR"       "BGN_DATE"   "EVTYPE"     "STATE"     
 [6] "C_ZONE"     "DEATHS"     "INJURIES"   "PROPDMG"    "PROPDMGEXP"
[11] "CROPDMG"    "CROPDMGEXP" "REMARKS"

3. Data Cleaning

3.1 Cleaning: Event types (`EVTYPE`)

The NOAA storm database specification reports 48 event types, while in the storm data set are almost thousand, viz: 985.

See how the number of unique event types changed over the years on the interactive plot:

histevtype<- storm %>% group_by(YEAR) %>%
    summarise(uniEV = length(unique(EVTYPE))) %>%
    ggplot( aes(x = YEAR, y = uniEV)) +
    geom_bar(stat="identity", fill="cornflowerblue", colour = "darkgrey")+
  xlab("Year")+ ylab("Number of Unique Event Types")+
  labs(title =
         "OPTIONAL: NOAA Database Unique Event Types Recorded \n(interactive)")+
theme(plot.title = element_text(size = rel(0.8)),
      axis.line = element_line(size = 3, colour = "grey80"))
ggplotly(histevtype)

rm(histevtype)

Until the $1990$s, there were only 1—3 event types, then their number increased sharply by $1995$. After that, it began to decline, and since about $2003-2005$, has corresponded to 48 types.

Almost the same picture can be seen on the NOAA website, and in the tables:

Before 1954, there only were tracked:

table(storm[YEAR <= 1954, EVTYPE])


TORNADO 
   1858

From 1955 through 1992 were only reported:

table(storm[YEAR >= 1955 & YEAR <= 1992, EVTYPE])


     HAIL   TORNADO TSTM WIND 
    61527     32644     90637

This unevenness in the number of event types can be addressed by calculating average frequency of event types over $1993-2011$ years, with a further adjusting damage/ harm of a particular event type based on the full base $1950-2011$.

Before that, however, it would be nice to deal with a huge number of unique types of events arose in the $1990$s. Cast a look at some examples of them:

set.seed(123)
sample(unique(storm[, EVTYPE]), size = 10)

 [1] "RECORD TEMPERATURES"   "ICE AND SNOW"          "GLAZE"                
 [4] "EARLY FROST"           "NORMAL PRECIPITATION"  "SMOKE"                
 [7] "COASTALSTORM"          "HIGH WINDS DUST STORM" "THUNDERSTORMW 50"     
[10] "MUD SLIDE"

So, according to NWS Storm Data Codebook (“Instruction 10-1605”) only those 48 unique event types should be there:

Ntypes <- c("ASTRONOMICAL LOW TIDE",
            "AVALANCHE",
            "BLIZZARD",
            "COASTAL FLOOD",
            "COLD/WIND CHILL",
            "DEBRIS FLOW",
            "DENSE FOG",
            "DENSE SMOKE",
            "DROUGHT",
            "DUST DEVIL",
            "DUST STORM",
            "EXCESSIVE HEAT",
            "EXTREME COLD/WIND CHILL",
            "FLASH FLOOD",
            "FLOOD",
            "FROST/FREEZE",
            "FUNNEL CLOUD",
            "FREEZING FOG",
            "HAIL",
            "HEAT",
            "HEAVY RAIN",
            "HEAVY SNOW",
            "HIGH SURF",
            "HIGH WIND",
            "HURRICANE (TYPHOON)",
            "ICE STORM",
            "LAKE-EFFECT SNOW",
            "LAKESHORE FLOOD",
            "LIGHTNING",
            "MARINE DENSE FOG",
            "MARINE HAIL",
            "MARINE HEAVY FREEZING SPRAY",
            "MARINE HIGH WIND",
            "MARINE HURRICANE/TYPHOON",
            "MARINE LIGHTNING",
            "MARINE STRONG WIND",
            "MARINE THUNDERSTORM WIND",
            "MARINE TROPICAL DEPRESSION",
            "MARINE TROPICAL STORM",
            "RIP CURRENT",
            "SEICHE",
            "SLEET",
            "SNEAKER WAVE",
            "STORM SURGE/TIDE",
            "STRONG WIND",
            "THUNDERSTORM WIND",
            "TORNADO",
            "TROPICAL DEPRESSION",
            "TROPICAL STORM",
            "TSUNAMI",
            "VOLCANIC ASH",
            "WATERSPOUT",
            "WILDFIRE",
            "WINTER STORM",
            "WINTER WEATHER")
Ntypes

 [1] "ASTRONOMICAL LOW TIDE"       "AVALANCHE"                  
 [3] "BLIZZARD"                    "COASTAL FLOOD"              
 [5] "COLD/WIND CHILL"             "DEBRIS FLOW"                
 [7] "DENSE FOG"                   "DENSE SMOKE"                
 [9] "DROUGHT"                     "DUST DEVIL"                 
[11] "DUST STORM"                  "EXCESSIVE HEAT"             
[13] "EXTREME COLD/WIND CHILL"     "FLASH FLOOD"                
[15] "FLOOD"                       "FROST/FREEZE"               
[17] "FUNNEL CLOUD"                "FREEZING FOG"               
[19] "HAIL"                        "HEAT"                       
[21] "HEAVY RAIN"                  "HEAVY SNOW"                 
[23] "HIGH SURF"                   "HIGH WIND"                  
[25] "HURRICANE (TYPHOON)"         "ICE STORM"                  
[27] "LAKE-EFFECT SNOW"            "LAKESHORE FLOOD"            
[29] "LIGHTNING"                   "MARINE DENSE FOG"           
[31] "MARINE HAIL"                 "MARINE HEAVY FREEZING SPRAY"
[33] "MARINE HIGH WIND"            "MARINE HURRICANE/TYPHOON"   
[35] "MARINE LIGHTNING"            "MARINE STRONG WIND"         
[37] "MARINE THUNDERSTORM WIND"    "MARINE TROPICAL DEPRESSION" 
[39] "MARINE TROPICAL STORM"       "RIP CURRENT"                
[41] "SEICHE"                      "SLEET"                      
[43] "SNEAKER WAVE"                "STORM SURGE/TIDE"           
[45] "STRONG WIND"                 "THUNDERSTORM WIND"          
[47] "TORNADO"                     "TROPICAL DEPRESSION"        
[49] "TROPICAL STORM"              "TSUNAMI"                    
[51] "VOLCANIC ASH"                "WATERSPOUT"                 
[53] "WILDFIRE"                    "WINTER STORM"               
[55] "WINTER WEATHER"

To conform EVTYPE variable values to NWS Storm Data Codebook, I’ve created two cleaning event types functions using regexes (Appendix: codes events, splitter). When working on them, I took into account NWS Storm Data Instructions, as well as used NOAA Documentation and REMARKS variable details. Due to the lack of accuracy of original data though, and sometimes not very clear rules, few events could sometimes have been attributed not to their correct groups.

events function corrects EVTYPES values to match them the official types list Ntypes, off and on resulting in those consisting of multiple (2—3) event types separated by commas, i.e:

storm[,EVTYPE:=events(data.table(storm$EVTYPE))]
stormTest[,EVTYPE:=events(data.table(stormTest$EVTYPE))]
set.seed(456)
sample(unique(storm[, EVTYPE]), size = 6)

[1] "MARINE HAIL"                      "HEAVY RAIN,HIGH WIND"            
[3] "BLIZZARD,FROST/FREEZE"            "DEBRIS FLOW,HEAVY RAIN"          
[5] "BLIZZARD"                         "FUNNEL CLOUD,HURRICANE (TYPHOON)"

Implementing this function reduces the number of type events from 985 to 102. In turn, splitter function

storm <- splitter(storm)
storm$DEATHS <- round(storm$DEATHS,0); storm$INJURIES <- round(storm$INJURIES,0)
lenEV<- length(which(unique(storm$EVTYPE) %in% Ntypes)) #48

splits the aforementioned “multiple” event types into single event types, adjusting the damage/ harm caused by the event accordingly. In fine, among the remaining types of events, all 48 correspond to those official in NWS Storm Data Codebook/NOAA specification:

levels(storm$EVTYPE)

 [1] "TORNADO"                  "THUNDERSTORM WIND"       
 [3] "HAIL"                     "HEAVY SNOW"              
 [5] "WINTER STORM"             "HURRICANE (TYPHOON)"     
 [7] "EXTREME COLD/WIND CHILL"  "HEAVY RAIN"              
 [9] "LIGHTNING"                "DENSE FOG"               
[11] "RIP CURRENT"              "FLASH FLOOD"             
[13] "HIGH WIND"                "FUNNEL CLOUD"            
[15] "HEAT"                     "STRONG WIND"             
[17] "FLOOD"                    "COLD/WIND CHILL"         
[19] "WATERSPOUT"               "BLIZZARD"                
[21] "FROST/FREEZE"             "COASTAL FLOOD"           
[23] "STORM SURGE/TIDE"         "EXCESSIVE HEAT"          
[25] "ICE STORM"                "AVALANCHE"               
[27] "MARINE STRONG WIND"       "MARINE HIGH WIND"        
[29] "DUST STORM"               "SLEET"                   
[31] "DUST DEVIL"               "HIGH SURF"               
[33] "WILDFIRE"                 "DEBRIS FLOW"             
[35] "WINTER WEATHER"           "DROUGHT"                 
[37] "TROPICAL STORM"           "LAKESHORE FLOOD"         
[39] "LAKE-EFFECT SNOW"         "FREEZING FOG"            
[41] "VOLCANIC ASH"             "SEICHE"                  
[43] "TROPICAL DEPRESSION"      "DENSE SMOKE"             
[45] "MARINE THUNDERSTORM WIND" "ASTRONOMICAL LOW TIDE"   
[47] "MARINE HAIL"              "TSUNAMI"

rm(events,splitter)

3.2 Cleaning: monetary units (`PROPDMGEXP`, `CROPDMGEXP`)

To meet the principles of tidy data, there should be one column for each variable. However, variables PROPDMGEXP and CROPDMGEXP (“the monetary units for property/crops damage”), actually accompany PROPDMG and CROPDMG variables as multipliers (ten exponent) to interpret the numeric values for damage.

Start with checking what these monetary units are:

blank <- length(which(storm$PROPDMGEXP == ""|
                        storm$CROPDMGEXP == ""))
unique(c(unique(storm$PROPDMGEXP),unique(storm$CROPDMGEXP)))

 [1] "K" "M" ""  "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-" "1" "8"
[20] "k"

Well, there are alphabetical characters "H/h",K/k,M/m,B, as well as symbol -, ?, + numeric 0,1,2,3,4,5,6,7,8, and empty "".

3.2.1 Comparing multipliers to those in documentation

According to Storm Data Documentation, only alphabetical characters H, K, M, B have clear meaning: signify the magnitude of PROPDMG/ CROPDMG values respectively, and stand for multiplying them respectively by:

hundred ($=\times 10^2$),
kilo~thousand ($=\times 10^3$),
million ($=\times 10^6$),
billion ($=\times 10^9$)

As for blanks "", symbols -, ?, + and numeric 0...8, there is no any explanations. Initially, note there are 622038 observations with blanks "" in PROPDMGEXP/CROPDMGEXP, so they are for sure empty values (i.e. multiplier $=10^0=1$). Then, look at the distribution over the years of other strange values of PROPDMGEXP:

strangePROP <- storm[PROPDMGEXP %in%
                   c("-","?","+","0","1","2","3","4","5","6","7","8"),
                   .(PROPDMG, PROPDMGEXP, BGN_DATE, YEAR, C_ZONE, EVTYPE)]
table(strangePROP$YEAR, strangePROP$PROPDMGEXP)

      
         -   ?   +   0   1   2   3   4   5   6   7   8
  1993   0   2   0   1   0   0   0   0   0   1   0   0
  1994   0   1   1  28   0   0   0   1   2   0   0   0
  1995   1   5   5 186  25  13   4   3  26   3   5   1
  2011   0   0   0   1   0   0   0   0   0   0   0   0

and CROPDMGEXP:

strangeCROP <- storm[CROPDMGEXP %in%
                       c("-","?","+","0","1","2","3","4","5","6","7","8"),
                     .(CROPDMG,CROPDMGEXP, BGN_DATE, YEAR, C_ZONE, EVTYPE)]
table(strangeCROP$YEAR, strangeCROP$CROPDMGEXP)

Looks like issues with these monetary units happened in the range of $1993-1995$ period (as well as as previously noted EVTYPES) were caused by improper handling, and they were fixed during the $2012$ update (see version history of the Storm Events Database).

There are several ways to treat these ambiguous values, considering their share is about 0.04$\%$:

consider as missing values (put NA),
ignore (put $=0$),
assign averaged somehow values,
make a random sample,
…
…
write to god knows where,
…
…
COMPARISON way

3.2.2 Comparison way

My approach is to figure out how to deal with these obscure data by comparison them to the current ones in revised NOOA database

Convert PROPDMG/ CROPDMG to character variables to be able to compare them to corresponding (character) stormTest variables DAMAGE_PROPERTY/ DAMAGE_CROPS. Investigate then comparison tables (interactive, try to reorder and/ or apply various filters):

storm[, PROPDMG:= as.character(PROPDMG)][
  ,CROPDMG:= as.character(CROPDMG)]
comparePROP <- merge(strangePROP[,YEAR:=NULL], stormTest[
  ,.(DAMAGE_PROPERTY,BGN_DATE, C_ZONE, EVTYPE)],
                     by = c("BGN_DATE","C_ZONE","EVTYPE"), all=FALSE)
compareCROP <- merge(strangeCROP[,YEAR:=NULL], stormTest[,.(DAMAGE_CROPS,
                                               BGN_DATE, C_ZONE, EVTYPE)],
                     by = c("BGN_DATE","C_ZONE","EVTYPE"), all=FALSE)
datatable(comparePROP,
          caption =
            "Table1: Monetary units for PROPDMG vs. current DAMAGE_PROPERTY values", rownames = FALSE, filter="top",
          options = list(pageLength = 10, scrollX=T))

datatable(compareCROP,
          caption =
            "Table2: Monetary units for CROPDMG vs. current DAMAGE_PROPERTY values",
          rownames = FALSE, filter="top",

          options = list(pageLength = 10, scrollX=T))

It can be seen, for example, that value pairs 88 and 5 become in the revised base 885, values 26 and 6 become 266, values 5 and 0 become 50, and so on. So, it seems the proper way to deal with incomprehensible numeric values of PROPDMGEXP/ CROPDMGEXP is to paste them to corresponding values of PROPDMG/ CROPDMG.

As for non-numeric undefined values (blanks "" and symbols -, ?, +), from examination of the tables above, it is obvious that

"" and - mean multiply by 1 (~ not multiply by anything),
while ? and + — multiply by 10.

I’ve written a small R script mapping all that strange values of PROPDMGEXP/ CROPDMGEXP to proper numeric and incorporating them to PROPDMG/ CROPDMG variables by multiplying (Appendix: code mult). It starts with converting all values to uppercase, then changes them to ten exponents as discussed above, multiplies the corresponding values by them, and finally scales units to millions of dollars ($\times\$10^6$). Run this script, assign as a result of it new variables PROP ($, mln)/ PROP ($, mln), and then remove duplicates from storm again.

storm <- storm[
  ,`:=` ("PROP ($, mln)"=mult(PROPDMG,PROPDMGEXP),
         "CROP ($, mln)"=mult(CROPDMG,CROPDMGEXP))][
           , c(1:8,14:15,13)]
duo<- select(storm, -c(REFNUM, REMARKS)) %>% duplicated()
storm <- storm[!duo,]

4. Fixing Numeric Issues

4.1 Outliers: true and erroneous

4.1.1 FIGURE1. HARM and DAMAGE by DATE

Construct scatterplots (DEATHS$>0$, INJURIES$>0$, PROP ($, mln)$>0$, CROP ($, mln)$>0$) to visually detect where outliers could be:

DEATHS <- storm[DEATHS >0, c(3,7)]
INJURIES <- storm[INJURIES >0, c(3,8)]
PROP <- storm[`PROP ($, mln)` >0, c(3,9)]
CROP <- storm[`CROP ($, mln)` >0, c(3,10)]
gDEATHS<-ggplot(DEATHS, aes(x = BGN_DATE,y=round(DEATHS,0)))+
  geom_point(shape = 16, size = 2.3, colour = "firebrick3") +
  xlab("")+ ylab("deaths, count")+
  labs(title ="DEATHS by DATE")+
  theme(plot.title = element_text(size = rel(0.9)),
      axis.line = element_line(size = 3, colour = "grey80"))
gINJUR<- ggplot(INJURIES, aes(x = BGN_DATE, y = round(INJURIES,0))) +
  geom_point(shape = 16, size = 2.3, colour = "springgreen3") +
  xlab("")+ ylab("injuries, count")+
  labs(title ="INJURIES by DATE")+
  theme(plot.title = element_text(size = rel(0.9)),
      axis.line = element_line(size = 3, colour = "grey80"))
gPROP<-ggplot(PROP, aes(x = BGN_DATE,y=`PROP ($, mln)`))+
  geom_point(shape = 16, size = 2.3, colour = "mediumorchid3") +
  xlab("") + ylab("damage ($, mln)")+
  labs(title ="DAMAGE to PROPERTY by DATE")+
  theme(plot.title = element_text(size = rel(0.9)),
      axis.line = element_line(size = 3, colour = "grey80"))
gCROP<- ggplot(CROP, aes(x = BGN_DATE, y = `CROP ($, mln)`)) +
  geom_point(shape = 16, size = 2.3, colour = "darkorange3") +
  xlab("") + ylab("damage ($, mln)")+
  labs(title ="DAMAGE to CROPS by DATE")+
  theme(plot.title = element_text(size = rel(0.9)),
      axis.line = element_line(size = 3, colour = "grey80"))
grid.arrange(gDEATHS, gINJUR, gPROP, gCROP, nrow=2,
             top="FIGURE1. HARM and DAMAGE by DATE")

OPTIONAL (external link): Interactive plots for DEATHS and INJURIES

4.1.2 Outliers detection

DEATHS. Most of the events resulted in $1{-}2$ deaths with max value of $583$ happened on July 12, 1995
INJURIES. Most of the events resulted in $1{-}4$ injuries with max value of $1\ 700$ happened on April 10, 1979. There are four events with values greater than $1\ 000$ injuries:
- $1\ 568$ on February 08, 1994,
- $1\ 228$ on June 09, 1953,
- $1\ 150$ on
  - April 03, 1974, and
  - May 22, 2011
PROPERTY and CROPS damage. Potential outliers:
- for PROP ($, mln) this is a value of about $110\ 000{-}120\ 000$ ($\$$, mln) around $2005$,
- for CROP ($, mln) — two values of about $5\ 000$ ($\$$, mln) around $1995$.

To deal with questionable values, explore tables with information on top-30 values events (try to reorder and/ or apply various filters). First, harm to humans:

HARMtop <- storm %>% mutate(TOTAL_HARM=DEATHS+INJURIES)%>%
  slice_max(TOTAL_HARM, n=30) %>%
  select(REFNUM,BGN_DATE,STATE, C_ZONE,EVTYPE,DEATHS,INJURIES,TOTAL_HARM)
datatable(HARMtop,
          caption = "Table3: Top-30 harm-to-humans values",
          rownames = FALSE, filter="top",
          options = list(pageLength = 10, scrollX=T, scrollY=T)) # NO KATRINA?

So now that the specifics of the events are clear, the potential outliers can be investigated separately:

DEATHSout<- storm %>% filter(DEATHS > 200) %>%
  select(-c(YEAR,REMARKS, `PROP ($, mln)`, `CROP ($, mln)`))%>% setorder(BGN_DATE)
INJURIESout<- storm %>% filter(INJURIES > 1000) %>%
  select(-c(YEAR,REMARKS, `PROP ($, mln)`, `CROP ($, mln)`))%>% setorder(BGN_DATE)
rbind(DEATHSout,INJURIESout)

   REFNUM   BGN_DATE    EVTYPE STATE C_ZONE DEATHS INJURIES
1: 198690 1995-07-12      HEAT    IL      0    583        0
2:  67884 1953-06-09   TORNADO    MA     27     90     1228
3: 116011 1974-04-03   TORNADO    OH     57     36     1150
4: 157885 1979-04-10   TORNADO    TX    485     42     1700
5: 223436 1994-02-08 ICE STORM    OH      0      1     1568
6: 862563 2011-05-22   TORNADO    MO     97    158     1150

I’ve found additional information on all events in the table and it seems there are no errors in the entries:

the highest number of deaths at $583$ is reported for “1995 Chicago heat wave”,
the highest number of injuries at $1\ 700$ is reported for “Terrible Tuesday” (“1979 Red River Valley tornado outbreak”),
$1\ 568$ injuries is reported for “Ice Storm of 1994” (“1994 North American cold wave”),
$1\ 228$ injuries is reported for “1953 Flint–Beecher tornado”,
$1\ 1150$ injuries of April 03, 1974 is reported for “1974 Super Outbreak”,
$1\ 1150$ injuries of May 22, 2011 is reported for “2011 Joplin tornado”.

Now, sort out doubtful values in damage to property/ crops (try to reorder and/ or apply various filters):

DAMtop <- storm %>% mutate(TOTAL_DAMAGE=`PROP ($, mln)`+`CROP ($, mln)`)%>%
  slice_max(TOTAL_DAMAGE, n=30) %>%
  select(REFNUM, BGN_DATE, STATE, C_ZONE, EVTYPE,`PROP ($, mln)`,`CROP ($, mln)`,
         TOTAL_DAMAGE)
datatable(DAMtop,
          caption = "Table4: Top-30 damage-to-prop/ crop values",
          rownames = FALSE, filter="top",
          options = list(pageLength = 10, scrollX=T)) # KATRINA #2—4

Explore the potential outliers singly:

PROPout<- storm %>% filter(`PROP ($, mln)` > 30000) %>%
  select(-c(YEAR,REMARKS, DEATHS, INJURIES)) %>% setorder(`PROP ($, mln)`)
CROPout<- storm %>% filter(`CROP ($, mln)` > 4000) %>%
  select(-c(YEAR,REMARKS, DEATHS, INJURIES))%>% setorder(BGN_DATE)
rbind(CROPout,PROPout)

   REFNUM   BGN_DATE           EVTYPE STATE C_ZONE PROP ($, mln) CROP ($, mln)
1: 198375 1993-08-31            FLOOD    IL      1        5000.0        5000.0
2: 211887 1994-02-09        ICE STORM    MS      0           0.5        5000.0
3: 577616 2005-08-29 STORM SURGE/TIDE    LA     40       31300.0           0.0
4: 605943 2006-01-01            FLOOD    CA     55      115000.0          32.5

Looks like there are no errors in the records on CROP ($, mln), also. The highest values of damage to crops at $5\ 000$ ($\$$, mln) is reported for

August 31, 1993: “Great Flood of 1993”,
February 09, 1994: “1994 Delta Ice Storm” (“1994 Southern Ice Storm”).

Not all is rosy with the variable PROP ($, mln), though. The third entry in the table above (REFNUM $=$ 577616) is for some part of Hurricane Katrina, so the values could be so high.

As for the fourth entry (REFNUM$=$ 605943), it relates to a flood in $2006$ with the amount of property damage indicated in the database at $\$115$ billion ($\$115\times 10^6$). This event refers to a series of “California flooding 2005-2006”. Quoting from the “Storms and Flooding in California in December 2005 and January 2006 — A Preliminary Assessment”:

A series of storms beginning before Christmas 2005 and ending after New Year’s Day 2006 produced significant runoff over much of northern California. The storms resulted in an estimated $300 million in damages and Federal disaster declarations in 10 counties.

So in this observation is clearly a typo, apparently arisen when the monetary unit (PROPDMGEXP) was entered, viz: instead of M for millions, B for billions were written.

4.1.3 Entries correction

From the above document, it can be learned the erroneously recorded event took place in Napa (C_ZONE $=$ 55) and Sonoma (C_ZONE $=$ 97) counties. See if there are any more misprints related to this series of events:

storm[((year(BGN_DATE)==2005 &month(BGN_DATE)== 12)|
         (year(BGN_DATE)== 2006 &month(BGN_DATE)== 1)) &
        EVTYPE == "FLOOD" & STATE == "CA" &
        C_ZONE %in% c(55, 97), c(1,3:6, 9:10)]

   REFNUM   BGN_DATE EVTYPE STATE C_ZONE PROP ($, mln) CROP ($, mln)
1: 567206 2005-12-31  FLOOD    CA     97           104           3.0
2: 567221 2005-12-31  FLOOD    CA     55           115          32.5
3: 605943 2006-01-01  FLOOD    CA     55        115000          32.5
4: 605951 2006-01-01  FLOOD    CA     97           104           3.0

It seems there are also two pairs of duplicates here: identical entries differing only in the date: December, 31/ January, 1 (REFNUM$=${567206, 605951}, and REFNUM$=${567221, 605943}). So, I’ve removed the later duplicate entries (January 1) #605951 and #605943, one of which was thereto erroneous.

storm<- storm[!(REFNUM %in% c(605951, 605943)), ]

OPTIONAL (external link): Interactive scatterplot “DAMAGE to PROPERTY by DATE, corrected”

4.2 Adjusting numeric values

The data spans more than sixty years. So to compare monetary values, it is necessary to adjust them for inflation, i.e. make a constant dollar calculations. This can be done with Consumer Price Index and R quantmod package. I’ve adjusted CPIs to the $2019$ baseline in a short R script (Appendix: code CPIadj). So now merge adjusted CPIs into storm data, and calculate new monetary values. Code chunk for this can be displayed by clicking the Code button on the right.

storm<- storm[,c(1:4,7:10)]
CPI <- CPIadj(storm)
storm <- merge(storm, CPI, by = "YEAR")
storm<-storm[, `PROP ($, mln)`:= round(`PROP ($, mln)`/adj,2)][
        , `CROP ($, mln)`:= round(`CROP ($, mln)`/adj,2)][
                ,c(2:1, 3:8)]

5. Analysis

5.1 OPTIONAL: Statistical measures with boxplots

(external link)

5.2 Event types frequency and danger

As is known, the number of event types in the original database is uneven over the years: from 1—3 until the early $1990$s, up to 48 by the mid-$2000$s. There are several approaches to solve this problem: ignore this information, or reduce data so that only analyse data starting from $1993/\ 1996/\ 2001$. As mentioned above, I’ve suggested address the issue calculating the average frequency of event types over $1993-2011$ years, and then adjusting total danger rarios (potential damage/ harm) of event types based on the full base $1950-2011$.

So, it’s time to implement this technique. Calculate the relative frequency of event types for the period $1993-2011$, when all 48 types (and even much more) were registered. And also, estimate the weights of the event types in the total aggregate of harm/ damage:

storm93 <- storm[YEAR >= 1993]
stormSH <- storm93 %>% group_by(EVTYPE) %>%
  summarise(COUNT = n(),
           HARM = sum(sum(DEATHS)+sum(INJURIES)),
           DAM = sum(sum(`PROP ($, mln)`) + sum(`CROP ($, mln)`))) %>% as.data.table()
stormSH<- stormSH[, `:=` ("EVTYPE,%"=round((COUNT*100/sum(COUNT)),2),
                          "HARM,%" = round((HARM*100/sum(HARM)),2),
                          "DAMAGE,%" = round((DAM*100/sum(DAM)),2))][
                            ,c(1, 5:7)]
setorder(stormSH,-`EVTYPE,%`)
datatable(stormSH,
          caption = "Table5, preliminary (1993-2011): Event types frequency/ weights in total harm/ damage",
          rownames = FALSE, filter="top",
          options = list(pageLength = 10, scrollX=T))

The table clearly shows that $1993$ through $2011$ the most frequent events were THUNDERSTORM WIND ($32.79\%$) and HAIL ($25.16\%$), the TORNADO brought the most significant share of harm to humans ($30.74\%$), while HURRICANE (TYPHOON) bore the main responsibility for the damage to property/ crops ($27.15\%$).

5.3 Final tables and plots

Based on the full data base $1950-2011$, derive the danger ratio (DR) for single event of each event type: the potential amount of harm/ damage one event of a certain type can result in:

SEdanger <- storm %>% group_by(EVTYPE) %>%
  summarise(COUNT = n(), DR_DEATHS = sum(DEATHS), DR_INJURIES = sum(INJURIES),
            DR_PROP = sum(`PROP ($, mln)`), DR_CROP = sum(`CROP ($, mln)`))%>%
  as.data.table()

Dratio<- SEdanger[, lapply(.SD, function(x) round(x/SEdanger$COUNT,2)),
                  .SDcols=c(3:6)]
Dratio<-cbind(EVTYPE=SEdanger[,EVTYPE], Dratio)
datatable(Dratio,
          caption = "Table6: Danger ratio for single events",
          rownames = FALSE, filter="top",
          options = list(pageLength = 10, scrollX=T))

The most danger in deaths and injuries single event is TSUNAMI, with DR $=1.65$ and DR $=6.45$ respectively, while to property and crops — HURRICANE (TYPHOON) (DR $=\$375.93$ mln and DR $=\$25.53$ mln)

Then, multiple the obtained ratios by the number of events of a certain type from the data for $1993-2011$. And, thus, it can be created a kind of virtual database, where the danger ratios correspond to the complete data ($1950-2011$), and the frequency of event types is based on data when all types of events were registered ($1993-2011$). Code chunk for this procedure can be displayed by clicking the Code button on the right.

ev93 <- storm93 %>% group_by(EVTYPE) %>% summarise(COUNT = n())
danger <-merge(ev93,Dratio, by=c("EVTYPE")) %>% as.data.table()
danger<- danger[
        , `:=` ("DEATHS, %"=round(
                DR_DEATHS*COUNT*100/sum((DR_DEATHS+DR_INJURIES)*COUNT),2),
                "INJURIES, %"=round(
                        DR_INJURIES*COUNT*100/
                                sum((DR_DEATHS+DR_INJURIES)*COUNT),2),
                "PROP, %"=round(
                        DR_PROP*COUNT*100/sum((DR_PROP+DR_CROP)*COUNT),2),
                "CROP, %"= round(
                        DR_CROP*COUNT*100/sum((DR_PROP+DR_CROP)*COUNT),2))][
                        ,c(1,7:10)]

Look now at the final tables showing weights of each event type in total harm and damage, as well as barplots illustrating the top ten of these tables. First, harm to human:

topHARM <- danger %>%
        transmute(EVTYPE,`TOTAL_HARM, %`= `DEATHS, %`+`INJURIES, %`,
                  `DEATHS, %`,`INJURIES, %`)%>%
        arrange(desc(`TOTAL_HARM, %`))
datatable(topHARM,
          caption = "Table7: Event types weights in total harm to humans",
          rownames = FALSE, filter="top",
          options = list(pageLength = 10, scrollX=T))

5.3.1 FIGURE2. Top 10 DANGERS to HUMANS, % (interactive)

topHARM <- topHARM %>% head(n = 10)
topHARM <- melt(select(topHARM, EVTYPE, `DEATHS, %`, `INJURIES, %`))
topHARM$EVTYPE <- reorder(topHARM$EVTYPE, topHARM$value)
gtopHARM<- topHARM %>%
    ggplot(aes(x = EVTYPE, y = value, fill = variable, order = variable)) +
  geom_bar(stat="identity") +
  coord_flip() +
  scale_fill_manual(name="", values =
                      c("firebrick3","springgreen3"))+
  xlab("Event Type")+ ylab("Total Harm, %")+
  labs(title ="FIGURE2. Top 10 DANGERS to HUMANS, % \n(interactive)")+
  theme(plot.title = element_text(size = rel(0.9)),
      axis.line = element_line(size = 3, colour = "grey80"))
ggplotly(gtopHARM)

The undisputed sad leader of danger to humans is TORNADO. It’s more than four times ahead of the closest pursuers in total harm (deaths and injuries) — EXCESSIVE HEAT, FLOOD, THUNDERSTORM WIND.

Take a look at damage to property and crops:

topDAM <- danger %>%
        transmute(EVTYPE,`TOTAL_DAM, %`= `PROP, %`+`CROP, %`,
                  `PROP, %`,`CROP, %`)%>%
        arrange(desc(`TOTAL_DAM, %`))
datatable(topDAM,
          caption = "Table8: Event types weights in total damage to property/ crops",
          rownames = FALSE, filter="top",
          options = list(pageLength = 10, scrollX=T))

5.3.2 FIGURE3. Top 10 DANGERS to ECONOMICS, % (interactive)

topDAM <- topDAM %>% head(n = 10)
topDAM <- melt(select(topDAM, EVTYPE, `PROP, %`, `CROP, %`))
topDAM$EVTYPE <- reorder(topDAM$EVTYPE, topDAM$value)
gtopDAM<- topDAM %>%
    ggplot(aes(x = EVTYPE, y = value, fill = variable, order = variable)) +
  geom_bar(stat="identity") +
  coord_flip() +
  scale_fill_manual(name="", values =
                      c("darkorange3","mediumorchid3"))+
  xlab("Event Type")+ ylab("Total Damage, %")+
  labs(title ="FIGURE3. Top 10 DANGERS to ECONOMICS, % \n(interactive)")+
  theme(plot.title = element_text(size = rel(0.9)),
      axis.line = element_line(size = 3, colour = "grey80"))
ggplotly(gtopDAM)

There is also an obvious leader in terms of total damage — HURRICANE (TYPHOON) It’s almost double the nearest TORNADO, FLOOD, STORM SURGE/TIDE. But it’s inferior to DROUGHT, FLOOD and ICE STORM in damage to crops.

6. Conclusions

6.1 Data

the original database is very dirty
- cleaning it took up most of the time and effort to complete the assignment,
- for every cleaning stage, there are no ideal solutions to address issues
there are a lot of typos and some number of duplicates
- many of them I’ve detected and fixed, but some may remain,
- this could lead to distorted results

6.2 Values

values of harm/ damage are highly skewed towards zero, and in non-zero values:
- number of injuries (median $=$ 4, maximum $=$), for obvious reasons, exceeds number of deaths (median $=$ 2, maximum $=$) in all statistical measures,
- damage to props (median $=\$30$ thsnd, maximum $=\$115$ bln) significantly exceeds damage to props (median $=\$50$ thsnd, maximum $=\$5$ bln) in maximum,
  - but is inferior in terms of quartiles, including median

6.3 RESULTS

Due to the complex ways of cleaning data, I’ve given answers to the questions of the assignment from different angles.

The number of event types is uneven in (1—3 up to 48 over years); to solve the problem, adjusting danger ratios of event types have been calculated.

relative frequency of event types/ event type weights in aggregate danger (preliminary estimate, for $1993-2011$):
- two undisputed leaders in frequency of events — THUNDERSTORM WIND ($32.79\%$) and HAIL ($25.16\%$),
  - FLASH FLOOD with 10% scored is in third place; so this trio cover almost 70% of events;
- undoubted sad winner in harm to people by a wide margin — TORNADO ($30.74\%$),
  - in the second place — EXCESSIVE HEAT ($11.13\%$), which, together with HEAT ($4.64\%$) gained almost $16.00\%$;
- HURRICANE (TYPHOON) ($27.15\%$) is primarily responsible for the damage to prop/ crops,
  - in second place is FLOOD ($14.16\%$), in third — STORM SURGE/TIDE ($13.64\%$),
  - the rest event types got less than $10.00\%$ each
single event danger ratio (for $1950-2011$)
- the most danger in deaths single event is TSUNAMI (DR $=1.65$),
  - in the second place is HEAT (DR $=1.11$), in the third — EXCESSIVE HEAT(DR $=1.01$),
  - for the rest single events DR $<1.00$;
- the most danger in injuries single event is also TSUNAMI (RD $=6.45$),
  - followed by HURRICANE (TYPHOON) (DR $=4.31$), EXCESSIVE HEAT (DR $=3.35$), HEAT (DR $=2.44$), TORNADO (DR $=1.67$), DUST STORM (DR $=1.01$),
  - for the rest single events DR $<1.00$;
- the most danger to property single event is HURRICANE (TYPHOON) (RD $=\$375.93$ mln),
  - in the second place is STORM SURGE/TIDE (DR $=\$150.48$ mln),
  - for the rest single events DR $<\$20.00$ mln,
- the most danger to crops single event is HURRICANE (TYPHOON) (RD $=\$25.53$ mln),
  - in the second place is DROUGHT (DR $=\$7.79$ mln),
  - for the rest single events DR $<$5.00
top ten dangers to humans/ economics ($1950-2011$)
- in comparison with the preliminary estimates ($1993$ to $2011$), there are generally similar leaders, with some differences in numbers
- in terms of danger to humans, TORNADO is ahead in all respects ($41.65\%$ total harm, of which $2.42\%$ are deaths, and $39.23\%$ — injuries),
  - besides, in total harm it’s more than four times ahead of the nearest pursuers, and in injuries - more than five times,
  - next come EXCESSIVE HEAT ($9.29\%$ in total harm: $2.14\%$ in deaths, $7.15\%$ in injuries), FLOOD ($7.70\%$ in total harm: $0.49\%$ in deaths, $7.21\%$ in injuries), THUNDERSTORM WIND ($7.51\%$ in total harm: $0.53\%$ in deaths, $6.98\%$ in injuries), LIGHTNING ($6.35\%$ in total harm: $0.86\%$ in deaths, $5.49\%$ in injuries),
    - this five cover $72.50\%$ of total harm, the rest event types got less than $5.00\%$ each,
    - in the top ten there are also HEAT and FLASH FLOOD, which noticeably strengthen the position of their elder sisters — EXCESSIVE HEAT and FLOOD
- in terms of danger to property/ crops, HURRICANE (TYPHOON) is the leader in respect of total damage and damage to property ($26.14\%$ total damage, of which $24.48\%$ are to property, and $1.66\%$ — to crops), doubling its closer pursuers
  - next come TORNADO ($13.80\%$ in total damage: $13.75\%$ — to property, $0.05\%$ — to crops), FLOOD ($13.63\%$ in total damage: $10.12\%$ — to property, $3.51\%$ — to crops), STORM SURGE/TIDE ($13.13\%$ in total damage: all $13.13\%$ — to property),
    - this four cover $66.70\%$ of total damage, the rest event types got less than $10.00\%$ each,
    - in the top ten there are also FLASH FLOOD, which noticeably strengthen the position of FLOOD, DROUGHT and even ICE STORM, being (as well as FLOOD) ahead of the HURRICANE (TYPHOON) in damage to crops.

Appendix

`loader`

loader <- function() {
        library("R.utils"); library("data.table")
        myload <- function(url, year = "", zip = "gz") {
                dest <- paste0("./data/1119_DS-RR-w4_Storm/storm",
                               year, ".csv.", zip)
                storm <- paste0("./data/1119_DS-RR-w4_Storm/storm",
                                year, ".csv")
                if(!file.exists(dest)) {download.file(url, destfile = dest,
                                                      method = "curl")}
                if(!file.exists(storm)) {ifelse(zip=="gz",
                                              gunzip(dest, remove=FALSE),
                                              bunzip2(dest, remove=FALSE))
                }
        }
        myload(url = "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
               zip = "bz2")
        myload(url = "https://www1.ncdc.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d1993_c20190920.csv.gz",
               year = 1993)
        myload(url = "https://www1.ncdc.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d1994_c20190920.csv.gz",
               year = 1994)
        myload(url = "https://www1.ncdc.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d1995_c20190920.csv.gz",
               year = 1995)
        myload(url = "https://www1.ncdc.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d1996_c20170717.csv.gz",
               year = 1996)
        myload(url = "https://www1.ncdc.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d2001_c20200518.csv.gz",
               year = 2001)
        myload(url = "https://www1.ncdc.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d2004_c20200518.csv.gz",
               year = 2004)
        myload(url = "https://www1.ncdc.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d2005_c20200518.csv.gz",
               year = 2005)
        myload(url = "https://www1.ncdc.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d2006_c20200518.csv.gz",
               year = 2006)
}

`events`

events <- function(evYY){
        evYY<- evYY[,V1:=toupper(V1)][
                ,V1:=gsub("[[:punct:]+]|[[:digit:]+]", " ", V1)][
                        ,V1:=gsub(" +", " ", V1)][
                                ,V1:=gsub("(^ )|( $)", "", V1)]
        evYY<-evYY[
                ,V1:=gsub("(.*BELOW.*)|(.*COUNTY.*)|(.*DROWNING.*)|
                          |(.*EXTREMELY.*)|(^.*D HIGH$)|(.*^LACK.*)|(.*METRO.*)|
                          |(.*MODERATE.*)|(.*MON.*)|(.*NONE.*)|(^NORMAL.*)|
                          |(.*NORTHERN.*)|(.*^OTHER$.*)|(.*SOUTHEAST.*)|
                          |(.*SUM.*)|(.*TURBULENCE.*)|
                          |((NO)? ?[SW]E[A-Z]+ WEA.*)|
                          |(.*[EIBL][LV][YE] WET(NESS)?.*)|(.*WET W.*)|
                          |(.*YEAR.*)","", V1)]
        evYY<-evYY[
                ,V1:=gsub("(ABNORMAL(LY)? ?)|( ?ACCUM[A-Z]+ ?)|
                          |(AGRICULT[A-Z]+ )|(AIR )|( ALBERTO)|( ALOFT)|
                          |( ?AND$)|( AWNING)|(BLACK )|(BRUSH )|( ?CONDITIONS?)|
                          |( ?DAMAG[A-Z]+ ?(TO)?)|( ?DE[AE][NP] ?)|( EDOUARD)|
                          |( EMILY)|( ERIN)|( EFFECTS?$)|(ERATURES?)|
                          |(([A-Z]+[HL])? ?EROS[A-Z]+ ?)|( ERUPTION)|
                          |(^EXCESSIVE$)|( EXPOSURE)|(EXTENDED )|
                          |( F$)|(FALL(ING)? ?)|( FELIX)|( FIRE WX)|(FIRST )|
                          |( FLOES?)|(FORCE )|
                          |( ?[CF][RO][IR][TEC][ES][R T] ?(IA)?)|(FROM.*)|
                          |( GENERATED.*)|( GORDON)|(GRADIENT )|(GRAS?S? )|
                          |(GROUND )|( GUSTS?)|(GUSTY )|(HAZAR[A-Z]+ ?)|
                          |( ?HIGH$)|( INJURY)|(ITATION)|(ICE JAMS? )|( JERRY)|
                          |( AND LIGHT$)|(LIGHT (S[A-Z]+)? ?(AND )?)|
                          |(LOCAL(LY)? )|(MAJOR )|(MILD ?(AND )?)|
                          |( ?MINOR ?)|(MOUNTAIN )|( MPH)|(MUD ?(SLIDE )?)|
                          |(NESS?)|(NEAR )|(NON SEVERE )|( (ON )?ROADS?)|
                          |( OPAL)|(PACK)|(PATCHY )|( ?PATTERN)|( PLUME)|
                          |(PROLONG(ED)? )|(REMNANTS? OF )|( ?ROCK )|
                          |(ROTA[A-Z]+ )|(S?S$)|(S? ?LE CEN)|(^SEASON[A-Z]+ )|
                          |(SMALL ?)|(( AND)? SMALL$)|(SML )|(SNOW DROUGHT)|
                          |( SPELL)|( S?TREE[TS]?)|(TH$)|(TORRENTIAL )|
                          |(UNUSUAL )|(VERY )|( WATCH)|( WAUSEON)|
                          |(( AND)? WET$)|(^WET )","",V1)][
                                  ,V1:=gsub("( AND )|( AND)|(S[ ,](AND )?)",
                                            " ", V1)]
        evYY<-evYY[
                ,V1:=gsub("((SEVERE )?(T[HU][A-Z].+)?[RST][TS][OR][OR]M ?W ?W?I?N?D?S?( G)?)|
                          |(.*GUST.*)|(.*[^T]BURST.*)|(TSTMW?( W)?I?(ND)?( G)?)|
                          |(^(SEVERE )?THUNDERSTORM$)|(^THUNDERSTORM (WIND)?)|
                          |(WIND STORM)","THUNDERSTORM WIND", V1)]
        evYY<- evYY[,V1:= gsub("((HEAVY )?(SNOW )?BLOWING SNOW?)|(.+SLEET)|
                               |(SLEET.+)|(.+SLEET.+)|
                               |(^(HEAVY )?SNOW[ S][FSIWT][^H].*)|(.*MIX.*)|
                               |(HAIL ?[ISW][A-Z]+)|(THUNDERSN.*)|
                               |(^[FIW][RCI][A-Z]+ ([RS][AT][A-Z]+ )?[SWH][NIA][ONI][A-Z]+$)|
                               |(FLUR.*)|(GLAZE ICE [A-Z]+)","WINTER STORM",V1)]
        evYY<- evYY[,V1:= gsub("((STREAM )?URBAN ?([FS][A-Z]+)?( F[A-Z]+)?)|
                               |([BFHR][RLIA][EAOGP][A-Z]+ [FR][LI][OAS][A-Z]+( WATER)?)|
                               |((RIVER )?STRE[A-Z]+( FL[A-Z]+)?)|
                               |(FLASH FLOOD $)","FLASH FLOOD",V1)]
        evYY<- evYY[,V1:= gsub("(^CO[OL][A-Z]+( W[A-Z]+)?)( CHIL?L?)?|
                               |([CL]O[LW]D? TEM[A-Z]+)|
                               |((COLD )?(LOW )?WIND ?CHI?L?L?( TEM[A-Z]+)?)|
                               |(COLD$)","COLD/WIND CHILL", V1)]
        evYY<- evYY[,V1:= gsub("((FLOOD )?([RS][IUN][A-Z]+ )?FLO[OY]D(ING)?( FLASH)?)|
                               |(DAM .*)|(HIGH WATER)", "FLOOD", V1)]
        evYY<- evYY[,V1:= gsub("(^([HI][A-Z]+ )?([SW][A-Z]+ )?BLIZZARD( [FHW][A-Z]+)?( [RSW][A-Z]+)?)|
                               |(DRIFT.*)|((HEAVY )?SNOW HIGH WIND?)|
                               |(HIGH WIND (HEAVY )?SNOW)|
                               |(W[A-Z]+ S[A-Z]+ H[A-Z]+ W[A-Z]+)",
                               "BLIZZARD", V1)]
        evYY<- evYY[,V1:= gsub("(COLD.+HIGH WIND)|(HIGH WIND COLD.*)",
                               "COLD/WIND CHILL BLIZZARD",V1)]
        evYY<- evYY[,V1:=gsub("(((EXTREME )?[BER][IXE][A-Z]+ )|(SEVERE ))COLD/",
                              "EXTREME COLD/", V1)]
        evYY<- evYY[,V1:= gsub("(RECORD C[A-Z]+)|(RECORD LOW$)|
                               |(COLD/WIND CHILL RECORD)",
                               "EXTREME COLD/WIND CHILL", V1)]
        evYY<- evYY[,V1:= gsub("((RECORD )?(WINTER )?(HEAVY )?(HIGH )?([EW][XW][A-Z]+ )?SNOW( SHOWER)?)|
                               |(SNOW ?[HRA][ED][A-Z]+( SNOW)?)",
                               "HEAVY SNOW", V1)]
        evYY<- evYY[,V1:= gsub("(RECORD )?(EARLY |EXCESSIVE |HE?A?VY |
                               |UNSEASONAL )?(RAIN|PREC[A-Z]+|
                               |SHOWER)( HEAVY|STORM)?", "HEAVY RAIN", V1)]
        evYY<- evYY[,V1:= gsub("(^(NON THUNDERSTORM |LAKE )?(WI?ND)( ADVISORY)?)|
                               |(HIGH WIND G)","STRONG WIND", V1)]
        evYY<- evYY[,V1:= gsub("FLOOD WIND", "FLOOD STRONG WIND", V1)][
                ,V1:= gsub("RAIN WIND", "RAIN STRONG WIND", V1)][
                        ,V1:= gsub("SURF WIND", "SURF STRONG WIND", V1)]
        evYY<- evYY[, V1:= gsub("LIGHTNING WIND","LIGHTNING STRONG WIND", V1)]
        evYY<- evYY[,V1:= gsub("(.*(MAY|EARLY|LATE|GLAZE|SPRAY).*)|
                               |(UN[A-Z]+ CO.*)|(^IC[EY]( JAM)?$)",
                               "WINTER WEATHER", V1)]
        evYY<- evYY[,V1:= gsub("(C(OA)?STA?L ?[FT].+)|(TIDAL.*)|(BEACH.*)",
                               "COASTAL FLOOD", V1)]
        evYY<- evYY[,V1:= gsub("(^HEAT ?[WB][A-Z]+)|
                               |((UN[A-Z]+ )?(HOT|WARM)( WEATHER| TEMP)?)|
                               |(HYPER.*)","HEAT", V1)]
        evYY<- evYY[,V1:= gsub("((HEAVY |HIGH |ROUGH )?SURF( HIGH SURF| ADV.+)?)|
                               |((HEAVY |(ST)?RO[A-Z]+ )(WIND )?(SWELL|WAVE|SEA))|(HIGH (SWELL|WAVE))", "HIGH SURF", V1)]
        evYY<- evYY[,V1:= gsub("((LAND)?SL[IU].*)|(DEBRI)", "DEBRIS FLOW", V1)][
                ,V1:= gsub("(((EXC|REC|UNS)[A-Z]+ )?DRY( WEATHER)?)|
                           |(RECORD LOW.*)","DROUGHT", V1)]
        evYY<- evYY[,V1:= gsub("(.*RECORD.*)|(EXTREME HEAT)",
                               "EXCESSIVE HEAT", V1)]
        evYY<- evYY[,V1:= gsub("((FROST |HARD )?FR[OE][A-Z]+$)|(HYPO[A-Z]+)",
                               "FROST/FREEZE", V1)]
        evYY<- evYY[,V1:= gsub(".*ASTR.*","ASTRONOMICAL LOW TIDE", V1)]
        evYY<- evYY[,V1:= gsub("(^(COASTAL)? ?S[TU][OR][RG][A-Z]+( SU[A-Z]+)?( T[A-Z]+)?$)|
                               |([BH][LI][A-Z]+( OUT)? TIDE)",
                               "STORM SURGE/TIDE", V1)]
        evYY<- evYY[,V1:= gsub("LIG[A-Z]+", "LIGHTNING", V1)][
                ,V1:= gsub("(TORN|WHIR|LAND)[A-Z]+", "TORNADO", V1)][
                        ,V1:= gsub("((HUR|^TYP).*)|(.*WALL CL[A-Z]+)",
                                   "HURRICANE (TYPHOON)", V1)]
        evYY<- evYY[,V1:= gsub("((WILD )?FIRE)|(RED.*)", "WILDFIRE", V1)][
                ,V1:= gsub("(WILD)?WILDFIRE", "WILDFIRE", V1)][
                        ,V1:= gsub("([BS][LA][OH][A-Z]+ D[A-Z]+)|(DUST ?STORM)",
                                   "DUST STORM", V1)]
        evYY<- evYY[,V1:= gsub("(FRE[A-Z]+ D[A-Z]+)", "ICE STORM", V1)][
                ,V1:= gsub("(FRE[A-Z]+ HEA)", "ICE STORM HEA", V1)]
        evYY<- evYY[,V1:= gsub("HEAVY (RAIN|SNOW|WET)( HEAVY)?( RAIN| SNOW)",
                               "SLEET", V1)]
        evYY<- evYY[,V1:= gsub("AVAL.*", "AVALANCHE", V1)][
                ,V1:= gsub("^[FV]OG", "DENSE FOG", V1)][
                        ,V1:= gsub("DU[A-Z]+ DE[A-Z]+", "DUST DEVIL", V1)][
                                ,V1:= gsub("^ICE [FV]OG", "FREEZING FOG", V1)]
        evYY<- evYY[,V1:= gsub("LAKE FLOOD","LAKESHORE FLOOD", V1)]
        evYY<- evYY[,V1:= gsub("(WAKE.*)|(HEAVY WIND)", "HIGH WIND", V1)][
                ,V1:= gsub("^SMOKE", "DENSE SMOKE", V1)][
                        ,V1:= gsub("FUNNEL( C[A-Z]+)?", "FUNNEL CLOUD", V1)][
                                ,V1:= gsub("ICE PELLET", "HAIL", V1)][
                                        ,V1:= gsub("DHAIL", "D HAIL", V1)]
        evYY<- evYY[,V1:= gsub(".*LAKE .*", "LAKE-EFFECT SNOW", V1)][
                ,V1:= gsub(".*SEA", "MARINE HIGH WIND", V1)][
                        ,V1:= gsub(".*A[CP].*", "MARINE STRONG WIND", V1)][
                                ,V1:= gsub("VOLC.*", "VOLCANIC ASH", V1)]
        evYY<- evYY[,V1:= gsub("WA[A-Z]+ ?S[A-Z]+", "WATERSPOUT", V1)]
        evYY<- evYY[,V1:= gsub("BLIZZARD AVALANCHE", "AVALANCHE,BLIZZARD", V1)]
        evYY<- evYY[,V1:= gsub("(BLIZZARD COLD/WIND CHILL)|
                               |(^COLD/WIND CHILL BLIZZARD)",
                               "BLIZZARD,COLD/WIND CHILL", V1)]
        evYY<- evYY[,V1:= gsub("(BLIZZARD EXTREME COLD/WIND CHILL)|
                               |(EXTREME COLD/WIND CHILL BLIZZARD)",
                               "BLIZZARD,EXTREME COLD/WIND CHILL", V1)]
        evYY<- evYY[,V1:= gsub("BLIZZARD FLOOD", "BLIZZARD,FLOOD", V1)][
                ,V1:= gsub("BLIZZARD FROST/FREEZE","BLIZZARD,FROST/FREEZE", V1)]
        evYY<- evYY[,V1:= gsub("COLD/WIND CHILL FROST/FREEZE",
                               "COLD/WIND CHILL,FROST/FREEZE", V1)]
        evYY<- evYY[,V1:= gsub("COLD/WIND CHILL FUNNEL CLOUD",
                               "COLD/WIND CHILL,FUNNEL CLOUD", V1)]
        evYY<- evYY[,V1:= gsub("(COLD/WIND CHILL HEAVY SNOW)|
                               |(HEAVY SNOW COLD/WIND CHILL)",
                               "COLD/WIND CHILL,HEAVY SNOW", V1)]
        evYY<- evYY[,V1:= gsub("COLD/WIND CHILL TORNADO",
                               "COLD/WIND CHILL,TORNADO", V1)]
        evYY<- evYY[,V1:= gsub("DENSE FOG COLD/WIND CHILL",
                               "COLD/WIND CHILL,DENSE FOG", V1)]
        evYY<- evYY[,V1:= gsub("(DROUGHT EXCESSIVE HEAT)|
                               |(EXCESSIVE HEAT DROUGHT)",
                               "DROUGHT,EXCESSIVE HEAT", V1)]
        evYY<-evYY[,V1:=gsub("(DROUGHT HEAT)|(HEAT DROUGHT)","DROUGHT,HEAT",V1)]
        evYY<- evYY[,V1:= gsub("(DUST STORM HIGH WIND)|(HIGH WIND DUST STORM)",
                               "DUST STORM,HIGH WIND", V1)]
        evYY<-evYY[,V1:=gsub("DUST DEVIL WATERSPOUT","DUST DEVIL,WATERSPOUT",V1)]
        evYY<- evYY[,V1:= gsub("(EXTREME COLD/WIND CHILL WINTER STORM)|
                               |(WINTER STORM EXTREME COLD/WIND CHILL)",
                               "EXTREME COLD/WIND CHILL,WINTER STORM", V1)]
        evYY<- evYY[,V1:= gsub("FLASH FLOOD DEBRIS FLOW",
                               "DEBRIS FLOW,FLASH FLOOD", V1)]
        evYY<- evYY[,V1:= gsub("(FLASH FLOOD HEAVY RAIN)|
                               |(HEAVY RAIN FLASH FLOOD)",
                               "FLASH FLOOD,HEAVY RAIN", V1)]
        evYY<- evYY[,V1:= gsub("FLASH FLOOD STRONG WIND",
                               "FLASH FLOOD,STRONG WIND", V1)]
        evYY<- evYY[,V1:= gsub("(FLASH FLOOD THUNDERSTORM WIND)|
                               |(THUNDERSTORM WIND FLASH FLOOD)",
                               "FLASH FLOOD,THUNDERSTORM WIND", V1)]
        evYY<- evYY[,V1:= gsub("(FLOOD HEAVY RAIN)|(HEAVY RAIN FLOOD)",
                               "FLOOD,HEAVY RAIN", V1)]
        evYY<- evYY[,V1:= gsub("FLOOD STRONG WIND", "FLOOD,STRONG WIND", V1)][
                ,V1:= gsub("FUNNEL CLOUD HAIL", "FUNNEL CLOUD,HAIL", V1)][
                        ,V1:= gsub("HAIL FLOOD", "FLOOD,HAIL", V1)][
                                ,V1:= gsub("HEAVY RAIN DEBRIS FLOW",
                                           "DEBRIS FLOW,HEAVY RAIN", V1)]
        evYY<-evYY[,V1:= gsub("HEAVY RAIN HIGH SURF","HEAVY RAIN,HIGH SURF",V1)]
        evYY<- evYY[,V1:= gsub("(HEAVY RAIN LIGHTNING)|(LIGHTNING HEAVY RAIN)",
                               "HEAVY RAIN,LIGHTNING", V1)]
        evYY<- evYY[,V1:= gsub("(HEAVY RAIN STRONG WIND)|
                               |(STRONG WIND HEAVY RAIN)",
                               "HEAVY RAIN,STRONG WIND", V1)]
        evYY<- evYY[,V1:= gsub("HEAVY SNOW EXTREME COLD/WIND CHILL",
                               "HEAVY SNOW,EXTREME COLD/WIND CHILL", V1)]
        evYY<- evYY[,V1:= gsub("HIGH SURF COASTAL FLOOD",
                               "COASTAL FLOOD,HIGH SURF",V1)]
        evYY<- evYY[,V1:= gsub("HIGH SURF STRONG WIND",
                               "HIGH SURF,STRONG WIND", V1)]
        evYY<- evYY[,V1:= gsub("HIGH WIND COASTAL FLOOD",
                               "COASTAL FLOOD,HIGH WIND", V1)]
        evYY<- evYY[,V1:= gsub("HIGH WIND FLOOD", "FLOOD,HIGH WIND", V1)][
                ,V1:= gsub("HIGH WIND HEAVY RAIN", "HEAVY RAIN,HIGH WIND", V1)]
        evYY<- evYY[,V1:= gsub("HIGH WIND STORM SURGE/TIDE",
                               "HIGH WIND,STORM SURGE/TIDE", V1)]
        evYY<- evYY[,V1:= gsub("HURRICANE [(]TYPHOON[)] FUNNEL CLOUD",
                               "FUNNEL CLOUD,HURRICANE (TYPHOON)", V1)]
        evYY<- evYY[,V1:= gsub("ICE STORM FLASH FLOOD",
                               "FLASH FLOOD,ICE STORM", V1)]
        evYY<- evYY[,V1:= gsub("ICE STORM FROST/FREEZE",
                               "FROST/FREEZE,ICE STORM", V1)]
        evYY<- evYY[,V1:=gsub("ICE STORM HEAVY RAIN","HEAVY RAIN,ICE STORM",V1)]
        evYY<- evYY[,V1:= gsub("LIGHTNING STRONG WIND",
                               "LIGHTNING,STRONG WIND", V1)]
        evYY<- evYY[,V1:= gsub("(LIGHTNING THUNDERSTORM WIND)|
                               |(THUNDERSTORM WIND LIGHTNING)",
                               "LIGHTNING,THUNDERSTORM WIND", V1)]
        evYY<- evYY[,V1:= gsub("LIGHTNING WILDFIRE", "LIGHTNING,WILDFIRE", V1)][
                ,V1:= gsub("RIP CURRENT HIGH SURF", "HIGH SURF,RIP CURRENT",V1)]
        evYY<- evYY[,V1:= gsub("THUNDERSTORM WIND FLOOD",
                               "FLOOD,THUNDERSTORM WIND", V1)]
        evYY<- evYY[,V1:= gsub("THUNDERSTORM WIND FUNNEL CLOUD",
                               "FUNNEL CLOUD,THUNDERSTORM WIND", V1)]
        evYY<- evYY[,V1:= gsub("THUNDERSTORM WIND HAIL",
                               "HAIL,THUNDERSTORM WIND", V1)]
        evYY<- evYY[,V1:= gsub("THUNDERSTORM WIND HEAVY RAIN",
                               "HEAVY RAIN,THUNDERSTORM WIND", V1)]
        evYY<- evYY[,V1:= gsub("TORNADO DEBRIS FLOW","DEBRIS FLOW,TORNADO",V1)][
                ,V1:= gsub("TORNADO HAIL","HAIL,TORNADO", V1)]
        evYY<- evYY[,V1:= gsub("(TORNADO WATERSPOUT)|(WATERSPOUT TORNADO)",
                               "TORNADO,WATERSPOUT", V1)]
        evYY<- evYY[,V1:= gsub("WATERSPOUT FUNNEL CLOUD",
                               "FUNNEL CLOUD,WATERSPOUT", V1)]
        evYY
}

`splitter`

splitter <- function(data){
        library(data.table); library(splitstackshape)
        astr<- data[(EVTYPE != ""),]
        astr0<- astr[EVTYPE %in% Ntypes,]
        astr <- astr[!(EVTYPE %in% Ntypes),]
        astr1 <-(unique(astr$EVTYPE[
                grep("((^[A-Z]+( [A-Z]+)?)|
                     |(^[A-Z]+( [A-Z]+)?/[A-Z]+( [A-Z]+)?)),((([A-Z]+ )?[A-Z]+$)|
                     |(([A-Z]+ )?[A-Z]+/[A-Z]+( [A-Z]+)?$)|
                     |([A-Z]+ [(][A-Z]+[)]))",astr$EVTYPE)]))
        astr1<- astr[EVTYPE %in% astr1,]
        astr1 <- cSplit(indt=astr1, splitCols = "EVTYPE", sep = ",",
                        direction = "long")
        astr1 <- astr1[,paste0(c(7:9,11),"_m"):=
                               lapply(.SD, function(x) x/2),.SDcols= c(7:9,11)][
                                       ,c(1:6,14:16,10,17,12:13)]
        setnames(astr1,c(7:9,11),
                 c("DEATHS","INJURIES","PROPDMG","CROPDMG"))
        astr2 <-(unique(astr$EVTYPE[
                grep("^[A-Z]+( [A-Z]+)?,[A-Z]+( [A-Z]+)?,[A-Z]+( [A-Z]+)?$",
                     astr$EVTYPE)]))
        astr2<- astr[EVTYPE %in% astr2,]
        astr2 <- cSplit(indt=astr2, splitCols = "EVTYPE", sep = ",",
                        direction = "long")
        astr2 <- astr2[,paste0(c(7:9,11),"_m"):=
                               lapply(.SD, function(x) x/3), .SDcols = c(7:9,11)][
                                       ,c(1:6,14:16,10,17,12:13)]
        setnames(astr2,c(7:9,11),
                 c("DEATHS","INJURIES","PROPDMG","CROPDMG"))
        astr<-rbind(astr0,astr1,astr2)
        setorder(astr,REFNUM)
        astr
}

`mult`

mult <- function(d,e) {
        e <- toupper(e)
        d <- ifelse(e %in% c("","-"), d,
                    ifelse(e %in% c("?","+"), as.numeric(d)*10^1,
                           ifelse(e == "H", as.numeric(d)*10^2,
                                  ifelse (e == "K", as.numeric(d)*10^3,
                                          ifelse(e == "M", as.numeric(d)*10^6,
                                                 ifelse (e =="B", as.numeric(d)*10^9,
                                                         paste0(d,e)))))))
        d<- round(as.numeric(d)/10^6,2)
        d
}

`CPIadj`

CPIadj <- function(data){
        getSymbols(Symbols = "CPIAUCSL", src="FRED")
        mon <- as.data.frame(CPIAUCSL)
        rm(CPIAUCSL)
        mon <- cbind(day = rownames(mon), mon)
        rownames(mon) <- c()
        mon$day <- as.character(mon$day)
        mon$YEAR <- year(as.IDate(mon$day, "%Y-%m-%d"))
        CPI <- mon %>% group_by(YEAR) %>%
                summarise(cpi = mean(CPIAUCSL)) %>%
                as.data.frame()
        rm(mon)
        CPI$adj <- CPI$cpi/CPI$cpi[CPI$YEAR == 2019]
        CPI <- CPI %>% filter(YEAR >= data[, min(year(BGN_DATE))] &
                                      YEAR <= data[, max(year(BGN_DATE))])
        CPI$cpi <- NULL
        CPI
}

`sessionInf` (for reproducibility)

R version 4.0.3 (2020-10-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 10.16

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] R.utils_2.10.1        R.oo_1.24.0           R.methodsS3_1.8.1    
 [4] splitstackshape_1.4.8 DT_0.16               plotly_4.9.2.1       
 [7] gridExtra_2.3         ggplot2_3.3.2         quantmod_0.4.18      
[10] TTR_0.24.2            xts_0.12.1            zoo_1.8-8            
[13] tidyr_1.1.2           dplyr_1.0.2           data.table_1.13.4    

loaded via a namespace (and not attached):
 [1] pillar_1.4.7      compiler_4.0.3    tools_4.0.3       digest_0.6.27    
 [5] viridisLite_0.3.0 jsonlite_1.7.2    evaluate_0.14     lifecycle_0.2.0  
 [9] tibble_3.0.4      gtable_0.3.0      lattice_0.20-41   pkgconfig_2.0.3  
[13] rlang_0.4.9       crosstalk_1.1.0.1 curl_4.3          yaml_2.2.1       
[17] xfun_0.19         httr_1.4.2        withr_2.3.0       stringr_1.4.0    
[21] knitr_1.30        htmlwidgets_1.5.2 generics_0.1.0    vctrs_0.3.5      
[25] grid_4.0.3        tidyselect_1.1.0  glue_1.4.2        R6_2.5.0         
[29] rmarkdown_2.5     farver_2.0.3      purrr_0.3.4       magrittr_2.0.1   
[33] scales_1.1.1      ellipsis_0.3.1    htmltools_0.5.1.1 colorspace_2.0-0 
[37] labeling_0.4.2    stringi_1.5.3     lazyeval_0.2.2    munsell_0.5.0    
[41] crayon_1.3.4

The Most Alarming Weather Events

Reproducible Research: Storm Project, Report

Kuba

November 19, 2020

1. Synopsys

2. Data Processing

2.1 Loading, reading, examination data

2.2 Removing duplicates, extracting variables of analysis interest

3. Data Cleaning

3.1 Cleaning: Event types (`EVTYPE`)

3.2 Cleaning: monetary units (`PROPDMGEXP`, `CROPDMGEXP`)

3.2.1 Comparing multipliers to those in documentation

3.2.2 Comparison way

4. Fixing Numeric Issues

4.1 Outliers: true and erroneous

4.1.1 FIGURE1. HARM and DAMAGE by DATE

4.1.2 Outliers detection

4.1.3 Entries correction

4.2 Adjusting numeric values

5. Analysis

5.1 OPTIONAL: Statistical measures with boxplots

5.2 Event types frequency and danger

5.3 Final tables and plots

5.3.1 FIGURE2. Top 10 DANGERS to HUMANS, % (interactive)

5.3.2 FIGURE3. Top 10 DANGERS to ECONOMICS, % (interactive)

6. Conclusions

6.1 Data

6.2 Values

6.3 RESULTS

Appendix

`loader`

`events`

`splitter`

`mult`

`CPIadj`

`sessionInf` (for reproducibility)

The Most Alarming Weather Events

Reproducible Research: Storm Project, Report

Kuba

November 19, 2020

1. Synopsys

2. Data Processing

2.1 Loading, reading, examination data

2.2 Removing duplicates, extracting variables of analysis interest

3. Data Cleaning

3.1 Cleaning: Event types (EVTYPE)

3.2 Cleaning: monetary units (PROPDMGEXP, CROPDMGEXP)

3.2.1 Comparing multipliers to those in documentation

3.2.2 Comparison way

4. Fixing Numeric Issues

4.1 Outliers: true and erroneous

4.1.1 FIGURE1. HARM and DAMAGE by DATE

4.1.2 Outliers detection

4.1.3 Entries correction

4.2 Adjusting numeric values

5. Analysis

5.1 OPTIONAL: Statistical measures with boxplots

5.2 Event types frequency and danger

5.3 Final tables and plots

5.3.1 FIGURE2. Top 10 DANGERS to HUMANS, % (interactive)

5.3.2 FIGURE3. Top 10 DANGERS to ECONOMICS, % (interactive)

6. Conclusions

6.1 Data

6.2 Values

6.3 RESULTS

Appendix

loader

events

splitter

mult

CPIadj

sessionInf (for reproducibility)

3.1 Cleaning: Event types (`EVTYPE`)

3.2 Cleaning: monetary units (`PROPDMGEXP`, `CROPDMGEXP`)

`loader`

`events`

`splitter`

`mult`

`CPIadj`

`sessionInf` (for reproducibility)