This is the second project for Reproducible Research course, which is a part of Coursera’s Data Science Specialization, and the final course of Coursera’s Data Science: Foundations Using R Specialization.
The report aims to find out which types of severe weather events are most harmful to population health and have the most significant economic consequences. It can be used for proper government preparation to such events so that to prioritise resources and prevent destructive outcomes to the extent possible.
To investigate this issue, I’ve explored the NOAA storm database, which was collected from The National Weather Service, as well as the media, law enforcement, other government agencies, private companies, individuals. The database tracks characteristics of most massive storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities/ injuries, and property/ crops damage.
The main conclusion is that across the U.S.,
The original data was obtained from the U.S. National Oceanic and Atmospheric Administration (NOAA site).
The events in the database for assignment start in the year \(1950\) and end in November \(2011\). Though, as data collection procedures have changed, the unique event types the data was being collected for have changed over years, and more event types were added over time. More information on the database (i.g. descriptions of event types, data set variables, as well as how the variables are constructed) can be found at:
Code buttonAppendixThe raw data has come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. So, it needs to be download, decompress, and then read to storm data.table.
Also, to sort out some issues the “test set” (stormTest), consisting of updated data for \(1993-1995,\ 2001,\ 2004-2006\) years, was made up. The revised versions for these years are available on the page Bulk Data Download (choose there ’Access: HTTP’). Download and read it all.
library(data.table); library(dplyr); library(tidyr); library(quantmod)
library(ggplot2); library(gridExtra); library(plotly)
library(DT); library(splitstackshape)
loader()
storm <- fread("./data/1119_DS-RR-w4_Storm/storm.csv")
stormTest<- lapply(list("./data/1119_DS-RR-w4_Storm/storm1993.csv",
"./data/1119_DS-RR-w4_Storm/storm1994.csv",
"./data/1119_DS-RR-w4_Storm/storm1995.csv",
"./data/1119_DS-RR-w4_Storm/storm2001.csv",
"./data/1119_DS-RR-w4_Storm/storm2004.csv",
"./data/1119_DS-RR-w4_Storm/storm2005.csv",
"./data/1119_DS-RR-w4_Storm/storm2006.csv"), fread)
stormTest <- rbindlist(stormTest)Examine the structure of the storm data, and the columns names of stormTest:
str(storm)Classes 'data.table' and 'data.frame': 902297 obs. of 37 variables:
$ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
$ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
$ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
$ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
$ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
$ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
$ STATE : chr "AL" "AL" "AL" "AL" ...
$ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
$ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
$ BGN_AZI : chr "" "" "" "" ...
$ BGN_LOCATI: chr "" "" "" "" ...
$ END_DATE : chr "" "" "" "" ...
$ END_TIME : chr "" "" "" "" ...
$ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
$ COUNTYENDN: logi NA NA NA NA NA NA ...
$ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
$ END_AZI : chr "" "" "" "" ...
$ END_LOCATI: chr "" "" "" "" ...
$ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
$ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
$ F : int 3 2 2 2 2 2 2 1 3 3 ...
$ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
$ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
$ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
$ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
$ PROPDMGEXP: chr "K" "K" "K" "K" ...
$ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
$ CROPDMGEXP: chr "" "" "" "" ...
$ WFO : chr "" "" "" "" ...
$ STATEOFFIC: chr "" "" "" "" ...
$ ZONENAMES : chr "" "" "" "" ...
$ LATITUDE : num 3040 3042 3340 3458 3412 ...
$ LONGITUDE : num 8812 8755 8742 8626 8642 ...
$ LATITUDE_E: num 3051 0 0 0 0 ...
$ LONGITUDE_: num 8806 0 0 0 0 ...
$ REMARKS : chr "" "" "" "" ...
$ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
- attr(*, ".internal.selfref")=<externalptr>
names(stormTest) [1] "BEGIN_YEARMONTH" "BEGIN_DAY" "BEGIN_TIME"
[4] "END_YEARMONTH" "END_DAY" "END_TIME"
[7] "EPISODE_ID" "EVENT_ID" "STATE"
[10] "STATE_FIPS" "YEAR" "MONTH_NAME"
[13] "EVENT_TYPE" "CZ_TYPE" "CZ_FIPS"
[16] "CZ_NAME" "WFO" "BEGIN_DATE_TIME"
[19] "CZ_TIMEZONE" "END_DATE_TIME" "INJURIES_DIRECT"
[22] "INJURIES_INDIRECT" "DEATHS_DIRECT" "DEATHS_INDIRECT"
[25] "DAMAGE_PROPERTY" "DAMAGE_CROPS" "SOURCE"
[28] "MAGNITUDE" "MAGNITUDE_TYPE" "FLOOD_CAUSE"
[31] "CATEGORY" "TOR_F_SCALE" "TOR_LENGTH"
[34] "TOR_WIDTH" "TOR_OTHER_WFO" "TOR_OTHER_CZ_STATE"
[37] "TOR_OTHER_CZ_FIPS" "TOR_OTHER_CZ_NAME" "BEGIN_RANGE"
[40] "BEGIN_AZIMUTH" "BEGIN_LOCATION" "END_RANGE"
[43] "END_AZIMUTH" "END_LOCATION" "BEGIN_LAT"
[46] "BEGIN_LON" "END_LAT" "END_LON"
[49] "EPISODE_NARRATIVE" "EVENT_NARRATIVE" "DATA_SOURCE"
rm(loader)There are \(902\ 297\) records on \(37\) variables in the storm raw data, and \(51\) variables in the stormTest set.
First, deal with duplicates.
duo<- select(storm, -c(REFNUM,REMARKS)) %>% duplicated()
storm <- storm[!duo,]
duostorm<-sum(duo) # 3954
duo<- select(stormTest, -c(EVENT_ID,EVENT_NARRATIVE)) %>% duplicated()
stormTest <- stormTest[!duo,]
duotest<-sum(duo) # 426Disregarding the id-variable REFNUM, the storm data contains 3954 duplicates, and the stormTest set \(-\) 426. The duplicated observations were removed.
Then, subset the storm data set on the parameters of interest. Exploring the structure of it, select following main variables for answering the questions:
FATALITIES/ INJURIES: deaths and injuries estimated for the event,PROPDMG/ CROPDMG: cost of property (buildings, power lines, vehicles)/ crops (crop, grain bins, cows killed) damage,PROPDMGEXP/ CROPDMGEXP: the monetary units used to interpret the numeric values of property/ crops damageEVTYPE: weather event type (TORNADO, THUNDERSTORM WIND, EXCESSIVE HEAT, DROUGHT, etc.)Some helper variables will be needed too, viz: REFNUM (id-variable), BGN_DATE (the date event/ public impact from the event begins), STATE, COUNTY (event location, county forecast zone), REMARKS (event narrative).
(Corresponding stormTest variables:
DEATHS_DIRECT+DEATHS_INDIRECT, INJURIES_DIRECT+INJURIES_INDIRECT, DAMAGE_PROPERTY, DAMAGE_CROPS (final values, w/o multipliers), EVENT_TYPE;
EVENT_ID, BEGIN_DATE_TIME, STATE, CZ_FIPS, EVENT_NARRATIVE.)
Extract all of the above variables from both storm and stormTest, and also:
REFNUM/ EVENT_ID and DAMAGE_PROPERTY/ PROPDMG, DAMAGE_CROPS/ CROPDMG)BGN_DATE variable to iDate class,YEAR variable.storm <- storm[,BGN_DATE:= as.IDate(BGN_DATE, "%m/%d/%Y %H:%M:%S")][
,.(REFNUM, YEAR= year(BGN_DATE), BGN_DATE, EVTYPE, STATE,C_ZONE= COUNTY,
DEATHS=FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG,
CROPDMGEXP, REMARKS)]
stormTest<-stormTest[,BGN_DATE:=as.IDate(BEGIN_DATE_TIME,"%d-%b-%y %H:%M:%S")][
,.(EVENT_ID, YEAR= year(BGN_DATE), BGN_DATE, EVTYPE= EVENT_TYPE,
STATE, C_ZONE= CZ_FIPS, DEATHS= DEATHS_DIRECT + DEATHS_INDIRECT,
INJURIES= INJURIES_DIRECT + INJURIES_INDIRECT,
DAMAGE_PROPERTY, DAMAGE_CROPS, REMARKS= EVENT_NARRATIVE)]Check, if there are any missing (NA) values in the subset data:
anyNA(storm,stormTest)[1] FALSE
and get acquainted with storm updated variables:
names(storm) [1] "REFNUM" "YEAR" "BGN_DATE" "EVTYPE" "STATE"
[6] "C_ZONE" "DEATHS" "INJURIES" "PROPDMG" "PROPDMGEXP"
[11] "CROPDMG" "CROPDMGEXP" "REMARKS"
EVTYPE)The NOAA storm database specification reports 48 event types, while in the storm data set are almost thousand, viz: 985.
See how the number of unique event types changed over the years on the interactive plot:
histevtype<- storm %>% group_by(YEAR) %>%
summarise(uniEV = length(unique(EVTYPE))) %>%
ggplot( aes(x = YEAR, y = uniEV)) +
geom_bar(stat="identity", fill="cornflowerblue", colour = "darkgrey")+
xlab("Year")+ ylab("Number of Unique Event Types")+
labs(title =
"OPTIONAL: NOAA Database Unique Event Types Recorded \n(interactive)")+
theme(plot.title = element_text(size = rel(0.8)),
axis.line = element_line(size = 3, colour = "grey80"))
ggplotly(histevtype)rm(histevtype)Until the \(1990\)s, there were only 1—3 event types, then their number increased sharply by \(1995\). After that, it began to decline, and since about \(2003-2005\), has corresponded to 48 types.
Almost the same picture can be seen on the NOAA website, and in the tables:
Before 1954, there only were tracked:
table(storm[YEAR <= 1954, EVTYPE])
TORNADO
1858
From 1955 through 1992 were only reported:
table(storm[YEAR >= 1955 & YEAR <= 1992, EVTYPE])
HAIL TORNADO TSTM WIND
61527 32644 90637
This unevenness in the number of event types can be addressed by calculating average frequency of event types over \(1993-2011\) years, with a further adjusting damage/ harm of a particular event type based on the full base \(1950-2011\).
Before that, however, it would be nice to deal with a huge number of unique types of events arose in the \(1990\)s. Cast a look at some examples of them:
set.seed(123)
sample(unique(storm[, EVTYPE]), size = 10) [1] "RECORD TEMPERATURES" "ICE AND SNOW" "GLAZE"
[4] "EARLY FROST" "NORMAL PRECIPITATION" "SMOKE"
[7] "COASTALSTORM" "HIGH WINDS DUST STORM" "THUNDERSTORMW 50"
[10] "MUD SLIDE"
So, according to NWS Storm Data Codebook (“Instruction 10-1605”) only those 48 unique event types should be there:
Ntypes <- c("ASTRONOMICAL LOW TIDE",
"AVALANCHE",
"BLIZZARD",
"COASTAL FLOOD",
"COLD/WIND CHILL",
"DEBRIS FLOW",
"DENSE FOG",
"DENSE SMOKE",
"DROUGHT",
"DUST DEVIL",
"DUST STORM",
"EXCESSIVE HEAT",
"EXTREME COLD/WIND CHILL",
"FLASH FLOOD",
"FLOOD",
"FROST/FREEZE",
"FUNNEL CLOUD",
"FREEZING FOG",
"HAIL",
"HEAT",
"HEAVY RAIN",
"HEAVY SNOW",
"HIGH SURF",
"HIGH WIND",
"HURRICANE (TYPHOON)",
"ICE STORM",
"LAKE-EFFECT SNOW",
"LAKESHORE FLOOD",
"LIGHTNING",
"MARINE DENSE FOG",
"MARINE HAIL",
"MARINE HEAVY FREEZING SPRAY",
"MARINE HIGH WIND",
"MARINE HURRICANE/TYPHOON",
"MARINE LIGHTNING",
"MARINE STRONG WIND",
"MARINE THUNDERSTORM WIND",
"MARINE TROPICAL DEPRESSION",
"MARINE TROPICAL STORM",
"RIP CURRENT",
"SEICHE",
"SLEET",
"SNEAKER WAVE",
"STORM SURGE/TIDE",
"STRONG WIND",
"THUNDERSTORM WIND",
"TORNADO",
"TROPICAL DEPRESSION",
"TROPICAL STORM",
"TSUNAMI",
"VOLCANIC ASH",
"WATERSPOUT",
"WILDFIRE",
"WINTER STORM",
"WINTER WEATHER")
Ntypes [1] "ASTRONOMICAL LOW TIDE" "AVALANCHE"
[3] "BLIZZARD" "COASTAL FLOOD"
[5] "COLD/WIND CHILL" "DEBRIS FLOW"
[7] "DENSE FOG" "DENSE SMOKE"
[9] "DROUGHT" "DUST DEVIL"
[11] "DUST STORM" "EXCESSIVE HEAT"
[13] "EXTREME COLD/WIND CHILL" "FLASH FLOOD"
[15] "FLOOD" "FROST/FREEZE"
[17] "FUNNEL CLOUD" "FREEZING FOG"
[19] "HAIL" "HEAT"
[21] "HEAVY RAIN" "HEAVY SNOW"
[23] "HIGH SURF" "HIGH WIND"
[25] "HURRICANE (TYPHOON)" "ICE STORM"
[27] "LAKE-EFFECT SNOW" "LAKESHORE FLOOD"
[29] "LIGHTNING" "MARINE DENSE FOG"
[31] "MARINE HAIL" "MARINE HEAVY FREEZING SPRAY"
[33] "MARINE HIGH WIND" "MARINE HURRICANE/TYPHOON"
[35] "MARINE LIGHTNING" "MARINE STRONG WIND"
[37] "MARINE THUNDERSTORM WIND" "MARINE TROPICAL DEPRESSION"
[39] "MARINE TROPICAL STORM" "RIP CURRENT"
[41] "SEICHE" "SLEET"
[43] "SNEAKER WAVE" "STORM SURGE/TIDE"
[45] "STRONG WIND" "THUNDERSTORM WIND"
[47] "TORNADO" "TROPICAL DEPRESSION"
[49] "TROPICAL STORM" "TSUNAMI"
[51] "VOLCANIC ASH" "WATERSPOUT"
[53] "WILDFIRE" "WINTER STORM"
[55] "WINTER WEATHER"
To conform EVTYPE variable values to NWS Storm Data Codebook, I’ve created two cleaning event types functions using regexes (Appendix: codes events, splitter). When working on them, I took into account NWS Storm Data Instructions, as well as used NOAA Documentation and REMARKS variable details. Due to the lack of accuracy of original data though, and sometimes not very clear rules, few events could sometimes have been attributed not to their correct groups.
events function corrects EVTYPES values to match them the official types list Ntypes, off and on resulting in those consisting of multiple (2—3) event types separated by commas, i.e:
storm[,EVTYPE:=events(data.table(storm$EVTYPE))]
stormTest[,EVTYPE:=events(data.table(stormTest$EVTYPE))]
set.seed(456)
sample(unique(storm[, EVTYPE]), size = 6)[1] "MARINE HAIL" "HEAVY RAIN,HIGH WIND"
[3] "BLIZZARD,FROST/FREEZE" "DEBRIS FLOW,HEAVY RAIN"
[5] "BLIZZARD" "FUNNEL CLOUD,HURRICANE (TYPHOON)"
Implementing this function reduces the number of type events from 985 to 102. In turn, splitter function
storm <- splitter(storm)
storm$DEATHS <- round(storm$DEATHS,0); storm$INJURIES <- round(storm$INJURIES,0)
lenEV<- length(which(unique(storm$EVTYPE) %in% Ntypes)) #48splits the aforementioned “multiple” event types into single event types, adjusting the damage/ harm caused by the event accordingly. In fine, among the remaining types of events, all 48 correspond to those official in NWS Storm Data Codebook/NOAA specification:
levels(storm$EVTYPE) [1] "TORNADO" "THUNDERSTORM WIND"
[3] "HAIL" "HEAVY SNOW"
[5] "WINTER STORM" "HURRICANE (TYPHOON)"
[7] "EXTREME COLD/WIND CHILL" "HEAVY RAIN"
[9] "LIGHTNING" "DENSE FOG"
[11] "RIP CURRENT" "FLASH FLOOD"
[13] "HIGH WIND" "FUNNEL CLOUD"
[15] "HEAT" "STRONG WIND"
[17] "FLOOD" "COLD/WIND CHILL"
[19] "WATERSPOUT" "BLIZZARD"
[21] "FROST/FREEZE" "COASTAL FLOOD"
[23] "STORM SURGE/TIDE" "EXCESSIVE HEAT"
[25] "ICE STORM" "AVALANCHE"
[27] "MARINE STRONG WIND" "MARINE HIGH WIND"
[29] "DUST STORM" "SLEET"
[31] "DUST DEVIL" "HIGH SURF"
[33] "WILDFIRE" "DEBRIS FLOW"
[35] "WINTER WEATHER" "DROUGHT"
[37] "TROPICAL STORM" "LAKESHORE FLOOD"
[39] "LAKE-EFFECT SNOW" "FREEZING FOG"
[41] "VOLCANIC ASH" "SEICHE"
[43] "TROPICAL DEPRESSION" "DENSE SMOKE"
[45] "MARINE THUNDERSTORM WIND" "ASTRONOMICAL LOW TIDE"
[47] "MARINE HAIL" "TSUNAMI"
rm(events,splitter)PROPDMGEXP, CROPDMGEXP)To meet the principles of tidy data, there should be one column for each variable. However, variables PROPDMGEXP and CROPDMGEXP (“the monetary units for property/crops damage”), actually accompany PROPDMG and CROPDMG variables as multipliers (ten exponent) to interpret the numeric values for damage.
Start with checking what these monetary units are:
blank <- length(which(storm$PROPDMGEXP == ""|
storm$CROPDMGEXP == ""))
unique(c(unique(storm$PROPDMGEXP),unique(storm$CROPDMGEXP))) [1] "K" "M" "" "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-" "1" "8"
[20] "k"
Well, there are alphabetical characters "H/h",K/k,M/m,B, as well as symbol -, ?, + numeric 0,1,2,3,4,5,6,7,8, and empty "".
According to Storm Data Documentation, only alphabetical characters H, K, M, B have clear meaning: signify the magnitude of PROPDMG/ CROPDMG values respectively, and stand for multiplying them respectively by:
hundred (\(=\times 10^2\)),kilo~thousand (\(=\times 10^3\)),million (\(=\times 10^6\)),billion (\(=\times 10^9\))As for blanks "", symbols -, ?, + and numeric 0...8, there is no any explanations. Initially, note there are 622038 observations with blanks "" in PROPDMGEXP/CROPDMGEXP, so they are for sure empty values (i.e. multiplier \(=10^0=1\)). Then, look at the distribution over the years of other strange values of PROPDMGEXP:
strangePROP <- storm[PROPDMGEXP %in%
c("-","?","+","0","1","2","3","4","5","6","7","8"),
.(PROPDMG, PROPDMGEXP, BGN_DATE, YEAR, C_ZONE, EVTYPE)]
table(strangePROP$YEAR, strangePROP$PROPDMGEXP)
- ? + 0 1 2 3 4 5 6 7 8
1993 0 2 0 1 0 0 0 0 0 1 0 0
1994 0 1 1 28 0 0 0 1 2 0 0 0
1995 1 5 5 186 25 13 4 3 26 3 5 1
2011 0 0 0 1 0 0 0 0 0 0 0 0
and CROPDMGEXP:
strangeCROP <- storm[CROPDMGEXP %in%
c("-","?","+","0","1","2","3","4","5","6","7","8"),
.(CROPDMG,CROPDMGEXP, BGN_DATE, YEAR, C_ZONE, EVTYPE)]
table(strangeCROP$YEAR, strangeCROP$CROPDMGEXP)
? 0 2
1993 2 2 0
1994 0 9 0
1995 6 8 1
Looks like issues with these monetary units happened in the range of \(1993-1995\) period (as well as as previously noted EVTYPES) were caused by improper handling, and they were fixed during the \(2012\) update (see version history of the Storm Events Database).
There are several ways to treat these ambiguous values, considering their share is about 0.04\(\%\):
NA),My approach is to figure out how to deal with these obscure data by comparison them to the current ones in revised NOOA database
Convert PROPDMG/ CROPDMG to character variables to be able to compare them to corresponding (character) stormTest variables DAMAGE_PROPERTY/ DAMAGE_CROPS. Investigate then comparison tables (interactive, try to reorder and/ or apply various filters):
storm[, PROPDMG:= as.character(PROPDMG)][
,CROPDMG:= as.character(CROPDMG)]
comparePROP <- merge(strangePROP[,YEAR:=NULL], stormTest[
,.(DAMAGE_PROPERTY,BGN_DATE, C_ZONE, EVTYPE)],
by = c("BGN_DATE","C_ZONE","EVTYPE"), all=FALSE)
compareCROP <- merge(strangeCROP[,YEAR:=NULL], stormTest[,.(DAMAGE_CROPS,
BGN_DATE, C_ZONE, EVTYPE)],
by = c("BGN_DATE","C_ZONE","EVTYPE"), all=FALSE)
datatable(comparePROP,
caption =
"Table1: Monetary units for PROPDMG vs. current DAMAGE_PROPERTY values", rownames = FALSE, filter="top",
options = list(pageLength = 10, scrollX=T))datatable(compareCROP,
caption =
"Table2: Monetary units for CROPDMG vs. current DAMAGE_PROPERTY values",
rownames = FALSE, filter="top",
options = list(pageLength = 10, scrollX=T))It can be seen, for example, that value pairs 88 and 5 become in the revised base 885, values 26 and 6 become 266, values 5 and 0 become 50, and so on. So, it seems the proper way to deal with incomprehensible numeric values of PROPDMGEXP/ CROPDMGEXP is to paste them to corresponding values of PROPDMG/ CROPDMG.
As for non-numeric undefined values (blanks "" and symbols -, ?, +), from examination of the tables above, it is obvious that
"" and - mean multiply by 1 (~ not multiply by anything),? and + — multiply by 10.I’ve written a small R script mapping all that strange values of PROPDMGEXP/ CROPDMGEXP to proper numeric and incorporating them to PROPDMG/ CROPDMG variables by multiplying (Appendix: code mult). It starts with converting all values to uppercase, then changes them to ten exponents as discussed above, multiplies the corresponding values by them, and finally scales units to millions of dollars (\(\times\$10^6\)). Run this script, assign as a result of it new variables PROP ($, mln)/ PROP ($, mln), and then remove duplicates from storm again.
storm <- storm[
,`:=` ("PROP ($, mln)"=mult(PROPDMG,PROPDMGEXP),
"CROP ($, mln)"=mult(CROPDMG,CROPDMGEXP))][
, c(1:8,14:15,13)]
duo<- select(storm, -c(REFNUM, REMARKS)) %>% duplicated()
storm <- storm[!duo,]Construct scatterplots (DEATHS\(>0\), INJURIES\(>0\), PROP ($, mln)\(>0\), CROP ($, mln)\(>0\)) to visually detect where outliers could be:
DEATHS <- storm[DEATHS >0, c(3,7)]
INJURIES <- storm[INJURIES >0, c(3,8)]
PROP <- storm[`PROP ($, mln)` >0, c(3,9)]
CROP <- storm[`CROP ($, mln)` >0, c(3,10)]
gDEATHS<-ggplot(DEATHS, aes(x = BGN_DATE,y=round(DEATHS,0)))+
geom_point(shape = 16, size = 2.3, colour = "firebrick3") +
xlab("")+ ylab("deaths, count")+
labs(title ="DEATHS by DATE")+
theme(plot.title = element_text(size = rel(0.9)),
axis.line = element_line(size = 3, colour = "grey80"))
gINJUR<- ggplot(INJURIES, aes(x = BGN_DATE, y = round(INJURIES,0))) +
geom_point(shape = 16, size = 2.3, colour = "springgreen3") +
xlab("")+ ylab("injuries, count")+
labs(title ="INJURIES by DATE")+
theme(plot.title = element_text(size = rel(0.9)),
axis.line = element_line(size = 3, colour = "grey80"))
gPROP<-ggplot(PROP, aes(x = BGN_DATE,y=`PROP ($, mln)`))+
geom_point(shape = 16, size = 2.3, colour = "mediumorchid3") +
xlab("") + ylab("damage ($, mln)")+
labs(title ="DAMAGE to PROPERTY by DATE")+
theme(plot.title = element_text(size = rel(0.9)),
axis.line = element_line(size = 3, colour = "grey80"))
gCROP<- ggplot(CROP, aes(x = BGN_DATE, y = `CROP ($, mln)`)) +
geom_point(shape = 16, size = 2.3, colour = "darkorange3") +
xlab("") + ylab("damage ($, mln)")+
labs(title ="DAMAGE to CROPS by DATE")+
theme(plot.title = element_text(size = rel(0.9)),
axis.line = element_line(size = 3, colour = "grey80"))
grid.arrange(gDEATHS, gINJUR, gPROP, gCROP, nrow=2,
top="FIGURE1. HARM and DAMAGE by DATE")OPTIONAL (external link): Interactive plots for DEATHS and INJURIES
PROP ($, mln) this is a value of about \(110\ 000{-}120\ 000\) (\(\$\), mln) around \(2005\),CROP ($, mln) — two values of about \(5\ 000\) (\(\$\), mln) around \(1995\).To deal with questionable values, explore tables with information on top-30 values events (try to reorder and/ or apply various filters). First, harm to humans:
HARMtop <- storm %>% mutate(TOTAL_HARM=DEATHS+INJURIES)%>%
slice_max(TOTAL_HARM, n=30) %>%
select(REFNUM,BGN_DATE,STATE, C_ZONE,EVTYPE,DEATHS,INJURIES,TOTAL_HARM)
datatable(HARMtop,
caption = "Table3: Top-30 harm-to-humans values",
rownames = FALSE, filter="top",
options = list(pageLength = 10, scrollX=T, scrollY=T)) # NO KATRINA?So now that the specifics of the events are clear, the potential outliers can be investigated separately:
DEATHSout<- storm %>% filter(DEATHS > 200) %>%
select(-c(YEAR,REMARKS, `PROP ($, mln)`, `CROP ($, mln)`))%>% setorder(BGN_DATE)
INJURIESout<- storm %>% filter(INJURIES > 1000) %>%
select(-c(YEAR,REMARKS, `PROP ($, mln)`, `CROP ($, mln)`))%>% setorder(BGN_DATE)
rbind(DEATHSout,INJURIESout) REFNUM BGN_DATE EVTYPE STATE C_ZONE DEATHS INJURIES
1: 198690 1995-07-12 HEAT IL 0 583 0
2: 67884 1953-06-09 TORNADO MA 27 90 1228
3: 116011 1974-04-03 TORNADO OH 57 36 1150
4: 157885 1979-04-10 TORNADO TX 485 42 1700
5: 223436 1994-02-08 ICE STORM OH 0 1 1568
6: 862563 2011-05-22 TORNADO MO 97 158 1150
I’ve found additional information on all events in the table and it seems there are no errors in the entries:
Now, sort out doubtful values in damage to property/ crops (try to reorder and/ or apply various filters):
DAMtop <- storm %>% mutate(TOTAL_DAMAGE=`PROP ($, mln)`+`CROP ($, mln)`)%>%
slice_max(TOTAL_DAMAGE, n=30) %>%
select(REFNUM, BGN_DATE, STATE, C_ZONE, EVTYPE,`PROP ($, mln)`,`CROP ($, mln)`,
TOTAL_DAMAGE)
datatable(DAMtop,
caption = "Table4: Top-30 damage-to-prop/ crop values",
rownames = FALSE, filter="top",
options = list(pageLength = 10, scrollX=T)) # KATRINA #2—4Explore the potential outliers singly:
PROPout<- storm %>% filter(`PROP ($, mln)` > 30000) %>%
select(-c(YEAR,REMARKS, DEATHS, INJURIES)) %>% setorder(`PROP ($, mln)`)
CROPout<- storm %>% filter(`CROP ($, mln)` > 4000) %>%
select(-c(YEAR,REMARKS, DEATHS, INJURIES))%>% setorder(BGN_DATE)
rbind(CROPout,PROPout) REFNUM BGN_DATE EVTYPE STATE C_ZONE PROP ($, mln) CROP ($, mln)
1: 198375 1993-08-31 FLOOD IL 1 5000.0 5000.0
2: 211887 1994-02-09 ICE STORM MS 0 0.5 5000.0
3: 577616 2005-08-29 STORM SURGE/TIDE LA 40 31300.0 0.0
4: 605943 2006-01-01 FLOOD CA 55 115000.0 32.5
Looks like there are no errors in the records on CROP ($, mln), also. The highest values of damage to crops at \(5\ 000\) (\(\$\), mln) is reported for
Not all is rosy with the variable PROP ($, mln), though. The third entry in the table above (REFNUM \(=\) 577616) is for some part of Hurricane Katrina, so the values could be so high.
As for the fourth entry (REFNUM\(=\) 605943), it relates to a flood in \(2006\) with the amount of property damage indicated in the database at \(\$115\) billion (\(\$115\times 10^6\)). This event refers to a series of “California flooding 2005-2006”. Quoting from the “Storms and Flooding in California in December 2005 and January 2006 — A Preliminary Assessment”:
A series of storms beginning before Christmas 2005 and ending after New Year’s Day 2006 produced significant runoff over much of northern California. The storms resulted in an estimated $300 million in damages and Federal disaster declarations in 10 counties.
So in this observation is clearly a typo, apparently arisen when the monetary unit (PROPDMGEXP) was entered, viz: instead of M for millions, B for billions were written.
From the above document, it can be learned the erroneously recorded event took place in Napa (C_ZONE \(=\) 55) and Sonoma (C_ZONE \(=\) 97) counties. See if there are any more misprints related to this series of events:
storm[((year(BGN_DATE)==2005 &month(BGN_DATE)== 12)|
(year(BGN_DATE)== 2006 &month(BGN_DATE)== 1)) &
EVTYPE == "FLOOD" & STATE == "CA" &
C_ZONE %in% c(55, 97), c(1,3:6, 9:10)] REFNUM BGN_DATE EVTYPE STATE C_ZONE PROP ($, mln) CROP ($, mln)
1: 567206 2005-12-31 FLOOD CA 97 104 3.0
2: 567221 2005-12-31 FLOOD CA 55 115 32.5
3: 605943 2006-01-01 FLOOD CA 55 115000 32.5
4: 605951 2006-01-01 FLOOD CA 97 104 3.0
It seems there are also two pairs of duplicates here: identical entries differing only in the date: December, 31/ January, 1 (REFNUM\(=\){567206, 605951}, and REFNUM\(=\){567221, 605943}). So, I’ve removed the later duplicate entries (January 1) #605951 and #605943, one of which was thereto erroneous.
storm<- storm[!(REFNUM %in% c(605951, 605943)), ]OPTIONAL (external link): Interactive scatterplot “DAMAGE to PROPERTY by DATE, corrected”
The data spans more than sixty years. So to compare monetary values, it is necessary to adjust them for inflation, i.e. make a constant dollar calculations. This can be done with Consumer Price Index and R quantmod package. I’ve adjusted CPIs to the \(2019\) baseline in a short R script (Appendix: code CPIadj). So now merge adjusted CPIs into storm data, and calculate new monetary values. Code chunk for this can be displayed by clicking the Code button on the right.
storm<- storm[,c(1:4,7:10)]
CPI <- CPIadj(storm)
storm <- merge(storm, CPI, by = "YEAR")
storm<-storm[, `PROP ($, mln)`:= round(`PROP ($, mln)`/adj,2)][
, `CROP ($, mln)`:= round(`CROP ($, mln)`/adj,2)][
,c(2:1, 3:8)]As is known, the number of event types in the original database is uneven over the years: from 1—3 until the early \(1990\)s, up to 48 by the mid-\(2000\)s. There are several approaches to solve this problem: ignore this information, or reduce data so that only analyse data starting from \(1993/\ 1996/\ 2001\). As mentioned above, I’ve suggested address the issue calculating the average frequency of event types over \(1993-2011\) years, and then adjusting total danger rarios (potential damage/ harm) of event types based on the full base \(1950-2011\).
So, it’s time to implement this technique. Calculate the relative frequency of event types for the period \(1993-2011\), when all 48 types (and even much more) were registered. And also, estimate the weights of the event types in the total aggregate of harm/ damage:
storm93 <- storm[YEAR >= 1993]
stormSH <- storm93 %>% group_by(EVTYPE) %>%
summarise(COUNT = n(),
HARM = sum(sum(DEATHS)+sum(INJURIES)),
DAM = sum(sum(`PROP ($, mln)`) + sum(`CROP ($, mln)`))) %>% as.data.table()
stormSH<- stormSH[, `:=` ("EVTYPE,%"=round((COUNT*100/sum(COUNT)),2),
"HARM,%" = round((HARM*100/sum(HARM)),2),
"DAMAGE,%" = round((DAM*100/sum(DAM)),2))][
,c(1, 5:7)]
setorder(stormSH,-`EVTYPE,%`)
datatable(stormSH,
caption = "Table5, preliminary (1993-2011): Event types frequency/ weights in total harm/ damage",
rownames = FALSE, filter="top",
options = list(pageLength = 10, scrollX=T))The table clearly shows that \(1993\) through \(2011\) the most frequent events were THUNDERSTORM WIND (\(32.79\%\)) and HAIL (\(25.16\%\)), the TORNADO brought the most significant share of harm to humans (\(30.74\%\)), while HURRICANE (TYPHOON) bore the main responsibility for the damage to property/ crops (\(27.15\%\)).
Based on the full data base \(1950-2011\), derive the danger ratio (DR) for single event of each event type: the potential amount of harm/ damage one event of a certain type can result in:
SEdanger <- storm %>% group_by(EVTYPE) %>%
summarise(COUNT = n(), DR_DEATHS = sum(DEATHS), DR_INJURIES = sum(INJURIES),
DR_PROP = sum(`PROP ($, mln)`), DR_CROP = sum(`CROP ($, mln)`))%>%
as.data.table()
Dratio<- SEdanger[, lapply(.SD, function(x) round(x/SEdanger$COUNT,2)),
.SDcols=c(3:6)]
Dratio<-cbind(EVTYPE=SEdanger[,EVTYPE], Dratio)
datatable(Dratio,
caption = "Table6: Danger ratio for single events",
rownames = FALSE, filter="top",
options = list(pageLength = 10, scrollX=T))The most danger in deaths and injuries single event is TSUNAMI, with DR \(=1.65\) and DR \(=6.45\) respectively, while to property and crops — HURRICANE (TYPHOON) (DR \(=\$375.93\) mln and DR \(=\$25.53\) mln)
Then, multiple the obtained ratios by the number of events of a certain type from the data for \(1993-2011\). And, thus, it can be created a kind of virtual database, where the danger ratios correspond to the complete data (\(1950-2011\)), and the frequency of event types is based on data when all types of events were registered (\(1993-2011\)). Code chunk for this procedure can be displayed by clicking the Code button on the right.
ev93 <- storm93 %>% group_by(EVTYPE) %>% summarise(COUNT = n())
danger <-merge(ev93,Dratio, by=c("EVTYPE")) %>% as.data.table()
danger<- danger[
, `:=` ("DEATHS, %"=round(
DR_DEATHS*COUNT*100/sum((DR_DEATHS+DR_INJURIES)*COUNT),2),
"INJURIES, %"=round(
DR_INJURIES*COUNT*100/
sum((DR_DEATHS+DR_INJURIES)*COUNT),2),
"PROP, %"=round(
DR_PROP*COUNT*100/sum((DR_PROP+DR_CROP)*COUNT),2),
"CROP, %"= round(
DR_CROP*COUNT*100/sum((DR_PROP+DR_CROP)*COUNT),2))][
,c(1,7:10)]Look now at the final tables showing weights of each event type in total harm and damage, as well as barplots illustrating the top ten of these tables. First, harm to human:
topHARM <- danger %>%
transmute(EVTYPE,`TOTAL_HARM, %`= `DEATHS, %`+`INJURIES, %`,
`DEATHS, %`,`INJURIES, %`)%>%
arrange(desc(`TOTAL_HARM, %`))
datatable(topHARM,
caption = "Table7: Event types weights in total harm to humans",
rownames = FALSE, filter="top",
options = list(pageLength = 10, scrollX=T))topHARM <- topHARM %>% head(n = 10)
topHARM <- melt(select(topHARM, EVTYPE, `DEATHS, %`, `INJURIES, %`))
topHARM$EVTYPE <- reorder(topHARM$EVTYPE, topHARM$value)
gtopHARM<- topHARM %>%
ggplot(aes(x = EVTYPE, y = value, fill = variable, order = variable)) +
geom_bar(stat="identity") +
coord_flip() +
scale_fill_manual(name="", values =
c("firebrick3","springgreen3"))+
xlab("Event Type")+ ylab("Total Harm, %")+
labs(title ="FIGURE2. Top 10 DANGERS to HUMANS, % \n(interactive)")+
theme(plot.title = element_text(size = rel(0.9)),
axis.line = element_line(size = 3, colour = "grey80"))
ggplotly(gtopHARM)The undisputed sad leader of danger to humans is TORNADO. It’s more than four times ahead of the closest pursuers in total harm (deaths and injuries) — EXCESSIVE HEAT, FLOOD, THUNDERSTORM WIND.
Take a look at damage to property and crops:
topDAM <- danger %>%
transmute(EVTYPE,`TOTAL_DAM, %`= `PROP, %`+`CROP, %`,
`PROP, %`,`CROP, %`)%>%
arrange(desc(`TOTAL_DAM, %`))
datatable(topDAM,
caption = "Table8: Event types weights in total damage to property/ crops",
rownames = FALSE, filter="top",
options = list(pageLength = 10, scrollX=T))topDAM <- topDAM %>% head(n = 10)
topDAM <- melt(select(topDAM, EVTYPE, `PROP, %`, `CROP, %`))
topDAM$EVTYPE <- reorder(topDAM$EVTYPE, topDAM$value)
gtopDAM<- topDAM %>%
ggplot(aes(x = EVTYPE, y = value, fill = variable, order = variable)) +
geom_bar(stat="identity") +
coord_flip() +
scale_fill_manual(name="", values =
c("darkorange3","mediumorchid3"))+
xlab("Event Type")+ ylab("Total Damage, %")+
labs(title ="FIGURE3. Top 10 DANGERS to ECONOMICS, % \n(interactive)")+
theme(plot.title = element_text(size = rel(0.9)),
axis.line = element_line(size = 3, colour = "grey80"))
ggplotly(gtopDAM)There is also an obvious leader in terms of total damage — HURRICANE (TYPHOON) It’s almost double the nearest TORNADO, FLOOD, STORM SURGE/TIDE. But it’s inferior to DROUGHT, FLOOD and ICE STORM in damage to crops.
Due to the complex ways of cleaning data, I’ve given answers to the questions of the assignment from different angles.
The number of event types is uneven in (1—3 up to 48 over years); to solve the problem, adjusting danger ratios of event types have been calculated.
loaderloader <- function() {
library("R.utils"); library("data.table")
myload <- function(url, year = "", zip = "gz") {
dest <- paste0("./data/1119_DS-RR-w4_Storm/storm",
year, ".csv.", zip)
storm <- paste0("./data/1119_DS-RR-w4_Storm/storm",
year, ".csv")
if(!file.exists(dest)) {download.file(url, destfile = dest,
method = "curl")}
if(!file.exists(storm)) {ifelse(zip=="gz",
gunzip(dest, remove=FALSE),
bunzip2(dest, remove=FALSE))
}
}
myload(url = "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
zip = "bz2")
myload(url = "https://www1.ncdc.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d1993_c20190920.csv.gz",
year = 1993)
myload(url = "https://www1.ncdc.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d1994_c20190920.csv.gz",
year = 1994)
myload(url = "https://www1.ncdc.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d1995_c20190920.csv.gz",
year = 1995)
myload(url = "https://www1.ncdc.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d1996_c20170717.csv.gz",
year = 1996)
myload(url = "https://www1.ncdc.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d2001_c20200518.csv.gz",
year = 2001)
myload(url = "https://www1.ncdc.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d2004_c20200518.csv.gz",
year = 2004)
myload(url = "https://www1.ncdc.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d2005_c20200518.csv.gz",
year = 2005)
myload(url = "https://www1.ncdc.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d2006_c20200518.csv.gz",
year = 2006)
}eventsevents <- function(evYY){
evYY<- evYY[,V1:=toupper(V1)][
,V1:=gsub("[[:punct:]+]|[[:digit:]+]", " ", V1)][
,V1:=gsub(" +", " ", V1)][
,V1:=gsub("(^ )|( $)", "", V1)]
evYY<-evYY[
,V1:=gsub("(.*BELOW.*)|(.*COUNTY.*)|(.*DROWNING.*)|
|(.*EXTREMELY.*)|(^.*D HIGH$)|(.*^LACK.*)|(.*METRO.*)|
|(.*MODERATE.*)|(.*MON.*)|(.*NONE.*)|(^NORMAL.*)|
|(.*NORTHERN.*)|(.*^OTHER$.*)|(.*SOUTHEAST.*)|
|(.*SUM.*)|(.*TURBULENCE.*)|
|((NO)? ?[SW]E[A-Z]+ WEA.*)|
|(.*[EIBL][LV][YE] WET(NESS)?.*)|(.*WET W.*)|
|(.*YEAR.*)","", V1)]
evYY<-evYY[
,V1:=gsub("(ABNORMAL(LY)? ?)|( ?ACCUM[A-Z]+ ?)|
|(AGRICULT[A-Z]+ )|(AIR )|( ALBERTO)|( ALOFT)|
|( ?AND$)|( AWNING)|(BLACK )|(BRUSH )|( ?CONDITIONS?)|
|( ?DAMAG[A-Z]+ ?(TO)?)|( ?DE[AE][NP] ?)|( EDOUARD)|
|( EMILY)|( ERIN)|( EFFECTS?$)|(ERATURES?)|
|(([A-Z]+[HL])? ?EROS[A-Z]+ ?)|( ERUPTION)|
|(^EXCESSIVE$)|( EXPOSURE)|(EXTENDED )|
|( F$)|(FALL(ING)? ?)|( FELIX)|( FIRE WX)|(FIRST )|
|( FLOES?)|(FORCE )|
|( ?[CF][RO][IR][TEC][ES][R T] ?(IA)?)|(FROM.*)|
|( GENERATED.*)|( GORDON)|(GRADIENT )|(GRAS?S? )|
|(GROUND )|( GUSTS?)|(GUSTY )|(HAZAR[A-Z]+ ?)|
|( ?HIGH$)|( INJURY)|(ITATION)|(ICE JAMS? )|( JERRY)|
|( AND LIGHT$)|(LIGHT (S[A-Z]+)? ?(AND )?)|
|(LOCAL(LY)? )|(MAJOR )|(MILD ?(AND )?)|
|( ?MINOR ?)|(MOUNTAIN )|( MPH)|(MUD ?(SLIDE )?)|
|(NESS?)|(NEAR )|(NON SEVERE )|( (ON )?ROADS?)|
|( OPAL)|(PACK)|(PATCHY )|( ?PATTERN)|( PLUME)|
|(PROLONG(ED)? )|(REMNANTS? OF )|( ?ROCK )|
|(ROTA[A-Z]+ )|(S?S$)|(S? ?LE CEN)|(^SEASON[A-Z]+ )|
|(SMALL ?)|(( AND)? SMALL$)|(SML )|(SNOW DROUGHT)|
|( SPELL)|( S?TREE[TS]?)|(TH$)|(TORRENTIAL )|
|(UNUSUAL )|(VERY )|( WATCH)|( WAUSEON)|
|(( AND)? WET$)|(^WET )","",V1)][
,V1:=gsub("( AND )|( AND)|(S[ ,](AND )?)",
" ", V1)]
evYY<-evYY[
,V1:=gsub("((SEVERE )?(T[HU][A-Z].+)?[RST][TS][OR][OR]M ?W ?W?I?N?D?S?( G)?)|
|(.*GUST.*)|(.*[^T]BURST.*)|(TSTMW?( W)?I?(ND)?( G)?)|
|(^(SEVERE )?THUNDERSTORM$)|(^THUNDERSTORM (WIND)?)|
|(WIND STORM)","THUNDERSTORM WIND", V1)]
evYY<- evYY[,V1:= gsub("((HEAVY )?(SNOW )?BLOWING SNOW?)|(.+SLEET)|
|(SLEET.+)|(.+SLEET.+)|
|(^(HEAVY )?SNOW[ S][FSIWT][^H].*)|(.*MIX.*)|
|(HAIL ?[ISW][A-Z]+)|(THUNDERSN.*)|
|(^[FIW][RCI][A-Z]+ ([RS][AT][A-Z]+ )?[SWH][NIA][ONI][A-Z]+$)|
|(FLUR.*)|(GLAZE ICE [A-Z]+)","WINTER STORM",V1)]
evYY<- evYY[,V1:= gsub("((STREAM )?URBAN ?([FS][A-Z]+)?( F[A-Z]+)?)|
|([BFHR][RLIA][EAOGP][A-Z]+ [FR][LI][OAS][A-Z]+( WATER)?)|
|((RIVER )?STRE[A-Z]+( FL[A-Z]+)?)|
|(FLASH FLOOD $)","FLASH FLOOD",V1)]
evYY<- evYY[,V1:= gsub("(^CO[OL][A-Z]+( W[A-Z]+)?)( CHIL?L?)?|
|([CL]O[LW]D? TEM[A-Z]+)|
|((COLD )?(LOW )?WIND ?CHI?L?L?( TEM[A-Z]+)?)|
|(COLD$)","COLD/WIND CHILL", V1)]
evYY<- evYY[,V1:= gsub("((FLOOD )?([RS][IUN][A-Z]+ )?FLO[OY]D(ING)?( FLASH)?)|
|(DAM .*)|(HIGH WATER)", "FLOOD", V1)]
evYY<- evYY[,V1:= gsub("(^([HI][A-Z]+ )?([SW][A-Z]+ )?BLIZZARD( [FHW][A-Z]+)?( [RSW][A-Z]+)?)|
|(DRIFT.*)|((HEAVY )?SNOW HIGH WIND?)|
|(HIGH WIND (HEAVY )?SNOW)|
|(W[A-Z]+ S[A-Z]+ H[A-Z]+ W[A-Z]+)",
"BLIZZARD", V1)]
evYY<- evYY[,V1:= gsub("(COLD.+HIGH WIND)|(HIGH WIND COLD.*)",
"COLD/WIND CHILL BLIZZARD",V1)]
evYY<- evYY[,V1:=gsub("(((EXTREME )?[BER][IXE][A-Z]+ )|(SEVERE ))COLD/",
"EXTREME COLD/", V1)]
evYY<- evYY[,V1:= gsub("(RECORD C[A-Z]+)|(RECORD LOW$)|
|(COLD/WIND CHILL RECORD)",
"EXTREME COLD/WIND CHILL", V1)]
evYY<- evYY[,V1:= gsub("((RECORD )?(WINTER )?(HEAVY )?(HIGH )?([EW][XW][A-Z]+ )?SNOW( SHOWER)?)|
|(SNOW ?[HRA][ED][A-Z]+( SNOW)?)",
"HEAVY SNOW", V1)]
evYY<- evYY[,V1:= gsub("(RECORD )?(EARLY |EXCESSIVE |HE?A?VY |
|UNSEASONAL )?(RAIN|PREC[A-Z]+|
|SHOWER)( HEAVY|STORM)?", "HEAVY RAIN", V1)]
evYY<- evYY[,V1:= gsub("(^(NON THUNDERSTORM |LAKE )?(WI?ND)( ADVISORY)?)|
|(HIGH WIND G)","STRONG WIND", V1)]
evYY<- evYY[,V1:= gsub("FLOOD WIND", "FLOOD STRONG WIND", V1)][
,V1:= gsub("RAIN WIND", "RAIN STRONG WIND", V1)][
,V1:= gsub("SURF WIND", "SURF STRONG WIND", V1)]
evYY<- evYY[, V1:= gsub("LIGHTNING WIND","LIGHTNING STRONG WIND", V1)]
evYY<- evYY[,V1:= gsub("(.*(MAY|EARLY|LATE|GLAZE|SPRAY).*)|
|(UN[A-Z]+ CO.*)|(^IC[EY]( JAM)?$)",
"WINTER WEATHER", V1)]
evYY<- evYY[,V1:= gsub("(C(OA)?STA?L ?[FT].+)|(TIDAL.*)|(BEACH.*)",
"COASTAL FLOOD", V1)]
evYY<- evYY[,V1:= gsub("(^HEAT ?[WB][A-Z]+)|
|((UN[A-Z]+ )?(HOT|WARM)( WEATHER| TEMP)?)|
|(HYPER.*)","HEAT", V1)]
evYY<- evYY[,V1:= gsub("((HEAVY |HIGH |ROUGH )?SURF( HIGH SURF| ADV.+)?)|
|((HEAVY |(ST)?RO[A-Z]+ )(WIND )?(SWELL|WAVE|SEA))|(HIGH (SWELL|WAVE))", "HIGH SURF", V1)]
evYY<- evYY[,V1:= gsub("((LAND)?SL[IU].*)|(DEBRI)", "DEBRIS FLOW", V1)][
,V1:= gsub("(((EXC|REC|UNS)[A-Z]+ )?DRY( WEATHER)?)|
|(RECORD LOW.*)","DROUGHT", V1)]
evYY<- evYY[,V1:= gsub("(.*RECORD.*)|(EXTREME HEAT)",
"EXCESSIVE HEAT", V1)]
evYY<- evYY[,V1:= gsub("((FROST |HARD )?FR[OE][A-Z]+$)|(HYPO[A-Z]+)",
"FROST/FREEZE", V1)]
evYY<- evYY[,V1:= gsub(".*ASTR.*","ASTRONOMICAL LOW TIDE", V1)]
evYY<- evYY[,V1:= gsub("(^(COASTAL)? ?S[TU][OR][RG][A-Z]+( SU[A-Z]+)?( T[A-Z]+)?$)|
|([BH][LI][A-Z]+( OUT)? TIDE)",
"STORM SURGE/TIDE", V1)]
evYY<- evYY[,V1:= gsub("LIG[A-Z]+", "LIGHTNING", V1)][
,V1:= gsub("(TORN|WHIR|LAND)[A-Z]+", "TORNADO", V1)][
,V1:= gsub("((HUR|^TYP).*)|(.*WALL CL[A-Z]+)",
"HURRICANE (TYPHOON)", V1)]
evYY<- evYY[,V1:= gsub("((WILD )?FIRE)|(RED.*)", "WILDFIRE", V1)][
,V1:= gsub("(WILD)?WILDFIRE", "WILDFIRE", V1)][
,V1:= gsub("([BS][LA][OH][A-Z]+ D[A-Z]+)|(DUST ?STORM)",
"DUST STORM", V1)]
evYY<- evYY[,V1:= gsub("(FRE[A-Z]+ D[A-Z]+)", "ICE STORM", V1)][
,V1:= gsub("(FRE[A-Z]+ HEA)", "ICE STORM HEA", V1)]
evYY<- evYY[,V1:= gsub("HEAVY (RAIN|SNOW|WET)( HEAVY)?( RAIN| SNOW)",
"SLEET", V1)]
evYY<- evYY[,V1:= gsub("AVAL.*", "AVALANCHE", V1)][
,V1:= gsub("^[FV]OG", "DENSE FOG", V1)][
,V1:= gsub("DU[A-Z]+ DE[A-Z]+", "DUST DEVIL", V1)][
,V1:= gsub("^ICE [FV]OG", "FREEZING FOG", V1)]
evYY<- evYY[,V1:= gsub("LAKE FLOOD","LAKESHORE FLOOD", V1)]
evYY<- evYY[,V1:= gsub("(WAKE.*)|(HEAVY WIND)", "HIGH WIND", V1)][
,V1:= gsub("^SMOKE", "DENSE SMOKE", V1)][
,V1:= gsub("FUNNEL( C[A-Z]+)?", "FUNNEL CLOUD", V1)][
,V1:= gsub("ICE PELLET", "HAIL", V1)][
,V1:= gsub("DHAIL", "D HAIL", V1)]
evYY<- evYY[,V1:= gsub(".*LAKE .*", "LAKE-EFFECT SNOW", V1)][
,V1:= gsub(".*SEA", "MARINE HIGH WIND", V1)][
,V1:= gsub(".*A[CP].*", "MARINE STRONG WIND", V1)][
,V1:= gsub("VOLC.*", "VOLCANIC ASH", V1)]
evYY<- evYY[,V1:= gsub("WA[A-Z]+ ?S[A-Z]+", "WATERSPOUT", V1)]
evYY<- evYY[,V1:= gsub("BLIZZARD AVALANCHE", "AVALANCHE,BLIZZARD", V1)]
evYY<- evYY[,V1:= gsub("(BLIZZARD COLD/WIND CHILL)|
|(^COLD/WIND CHILL BLIZZARD)",
"BLIZZARD,COLD/WIND CHILL", V1)]
evYY<- evYY[,V1:= gsub("(BLIZZARD EXTREME COLD/WIND CHILL)|
|(EXTREME COLD/WIND CHILL BLIZZARD)",
"BLIZZARD,EXTREME COLD/WIND CHILL", V1)]
evYY<- evYY[,V1:= gsub("BLIZZARD FLOOD", "BLIZZARD,FLOOD", V1)][
,V1:= gsub("BLIZZARD FROST/FREEZE","BLIZZARD,FROST/FREEZE", V1)]
evYY<- evYY[,V1:= gsub("COLD/WIND CHILL FROST/FREEZE",
"COLD/WIND CHILL,FROST/FREEZE", V1)]
evYY<- evYY[,V1:= gsub("COLD/WIND CHILL FUNNEL CLOUD",
"COLD/WIND CHILL,FUNNEL CLOUD", V1)]
evYY<- evYY[,V1:= gsub("(COLD/WIND CHILL HEAVY SNOW)|
|(HEAVY SNOW COLD/WIND CHILL)",
"COLD/WIND CHILL,HEAVY SNOW", V1)]
evYY<- evYY[,V1:= gsub("COLD/WIND CHILL TORNADO",
"COLD/WIND CHILL,TORNADO", V1)]
evYY<- evYY[,V1:= gsub("DENSE FOG COLD/WIND CHILL",
"COLD/WIND CHILL,DENSE FOG", V1)]
evYY<- evYY[,V1:= gsub("(DROUGHT EXCESSIVE HEAT)|
|(EXCESSIVE HEAT DROUGHT)",
"DROUGHT,EXCESSIVE HEAT", V1)]
evYY<-evYY[,V1:=gsub("(DROUGHT HEAT)|(HEAT DROUGHT)","DROUGHT,HEAT",V1)]
evYY<- evYY[,V1:= gsub("(DUST STORM HIGH WIND)|(HIGH WIND DUST STORM)",
"DUST STORM,HIGH WIND", V1)]
evYY<-evYY[,V1:=gsub("DUST DEVIL WATERSPOUT","DUST DEVIL,WATERSPOUT",V1)]
evYY<- evYY[,V1:= gsub("(EXTREME COLD/WIND CHILL WINTER STORM)|
|(WINTER STORM EXTREME COLD/WIND CHILL)",
"EXTREME COLD/WIND CHILL,WINTER STORM", V1)]
evYY<- evYY[,V1:= gsub("FLASH FLOOD DEBRIS FLOW",
"DEBRIS FLOW,FLASH FLOOD", V1)]
evYY<- evYY[,V1:= gsub("(FLASH FLOOD HEAVY RAIN)|
|(HEAVY RAIN FLASH FLOOD)",
"FLASH FLOOD,HEAVY RAIN", V1)]
evYY<- evYY[,V1:= gsub("FLASH FLOOD STRONG WIND",
"FLASH FLOOD,STRONG WIND", V1)]
evYY<- evYY[,V1:= gsub("(FLASH FLOOD THUNDERSTORM WIND)|
|(THUNDERSTORM WIND FLASH FLOOD)",
"FLASH FLOOD,THUNDERSTORM WIND", V1)]
evYY<- evYY[,V1:= gsub("(FLOOD HEAVY RAIN)|(HEAVY RAIN FLOOD)",
"FLOOD,HEAVY RAIN", V1)]
evYY<- evYY[,V1:= gsub("FLOOD STRONG WIND", "FLOOD,STRONG WIND", V1)][
,V1:= gsub("FUNNEL CLOUD HAIL", "FUNNEL CLOUD,HAIL", V1)][
,V1:= gsub("HAIL FLOOD", "FLOOD,HAIL", V1)][
,V1:= gsub("HEAVY RAIN DEBRIS FLOW",
"DEBRIS FLOW,HEAVY RAIN", V1)]
evYY<-evYY[,V1:= gsub("HEAVY RAIN HIGH SURF","HEAVY RAIN,HIGH SURF",V1)]
evYY<- evYY[,V1:= gsub("(HEAVY RAIN LIGHTNING)|(LIGHTNING HEAVY RAIN)",
"HEAVY RAIN,LIGHTNING", V1)]
evYY<- evYY[,V1:= gsub("(HEAVY RAIN STRONG WIND)|
|(STRONG WIND HEAVY RAIN)",
"HEAVY RAIN,STRONG WIND", V1)]
evYY<- evYY[,V1:= gsub("HEAVY SNOW EXTREME COLD/WIND CHILL",
"HEAVY SNOW,EXTREME COLD/WIND CHILL", V1)]
evYY<- evYY[,V1:= gsub("HIGH SURF COASTAL FLOOD",
"COASTAL FLOOD,HIGH SURF",V1)]
evYY<- evYY[,V1:= gsub("HIGH SURF STRONG WIND",
"HIGH SURF,STRONG WIND", V1)]
evYY<- evYY[,V1:= gsub("HIGH WIND COASTAL FLOOD",
"COASTAL FLOOD,HIGH WIND", V1)]
evYY<- evYY[,V1:= gsub("HIGH WIND FLOOD", "FLOOD,HIGH WIND", V1)][
,V1:= gsub("HIGH WIND HEAVY RAIN", "HEAVY RAIN,HIGH WIND", V1)]
evYY<- evYY[,V1:= gsub("HIGH WIND STORM SURGE/TIDE",
"HIGH WIND,STORM SURGE/TIDE", V1)]
evYY<- evYY[,V1:= gsub("HURRICANE [(]TYPHOON[)] FUNNEL CLOUD",
"FUNNEL CLOUD,HURRICANE (TYPHOON)", V1)]
evYY<- evYY[,V1:= gsub("ICE STORM FLASH FLOOD",
"FLASH FLOOD,ICE STORM", V1)]
evYY<- evYY[,V1:= gsub("ICE STORM FROST/FREEZE",
"FROST/FREEZE,ICE STORM", V1)]
evYY<- evYY[,V1:=gsub("ICE STORM HEAVY RAIN","HEAVY RAIN,ICE STORM",V1)]
evYY<- evYY[,V1:= gsub("LIGHTNING STRONG WIND",
"LIGHTNING,STRONG WIND", V1)]
evYY<- evYY[,V1:= gsub("(LIGHTNING THUNDERSTORM WIND)|
|(THUNDERSTORM WIND LIGHTNING)",
"LIGHTNING,THUNDERSTORM WIND", V1)]
evYY<- evYY[,V1:= gsub("LIGHTNING WILDFIRE", "LIGHTNING,WILDFIRE", V1)][
,V1:= gsub("RIP CURRENT HIGH SURF", "HIGH SURF,RIP CURRENT",V1)]
evYY<- evYY[,V1:= gsub("THUNDERSTORM WIND FLOOD",
"FLOOD,THUNDERSTORM WIND", V1)]
evYY<- evYY[,V1:= gsub("THUNDERSTORM WIND FUNNEL CLOUD",
"FUNNEL CLOUD,THUNDERSTORM WIND", V1)]
evYY<- evYY[,V1:= gsub("THUNDERSTORM WIND HAIL",
"HAIL,THUNDERSTORM WIND", V1)]
evYY<- evYY[,V1:= gsub("THUNDERSTORM WIND HEAVY RAIN",
"HEAVY RAIN,THUNDERSTORM WIND", V1)]
evYY<- evYY[,V1:= gsub("TORNADO DEBRIS FLOW","DEBRIS FLOW,TORNADO",V1)][
,V1:= gsub("TORNADO HAIL","HAIL,TORNADO", V1)]
evYY<- evYY[,V1:= gsub("(TORNADO WATERSPOUT)|(WATERSPOUT TORNADO)",
"TORNADO,WATERSPOUT", V1)]
evYY<- evYY[,V1:= gsub("WATERSPOUT FUNNEL CLOUD",
"FUNNEL CLOUD,WATERSPOUT", V1)]
evYY
}splittersplitter <- function(data){
library(data.table); library(splitstackshape)
astr<- data[(EVTYPE != ""),]
astr0<- astr[EVTYPE %in% Ntypes,]
astr <- astr[!(EVTYPE %in% Ntypes),]
astr1 <-(unique(astr$EVTYPE[
grep("((^[A-Z]+( [A-Z]+)?)|
|(^[A-Z]+( [A-Z]+)?/[A-Z]+( [A-Z]+)?)),((([A-Z]+ )?[A-Z]+$)|
|(([A-Z]+ )?[A-Z]+/[A-Z]+( [A-Z]+)?$)|
|([A-Z]+ [(][A-Z]+[)]))",astr$EVTYPE)]))
astr1<- astr[EVTYPE %in% astr1,]
astr1 <- cSplit(indt=astr1, splitCols = "EVTYPE", sep = ",",
direction = "long")
astr1 <- astr1[,paste0(c(7:9,11),"_m"):=
lapply(.SD, function(x) x/2),.SDcols= c(7:9,11)][
,c(1:6,14:16,10,17,12:13)]
setnames(astr1,c(7:9,11),
c("DEATHS","INJURIES","PROPDMG","CROPDMG"))
astr2 <-(unique(astr$EVTYPE[
grep("^[A-Z]+( [A-Z]+)?,[A-Z]+( [A-Z]+)?,[A-Z]+( [A-Z]+)?$",
astr$EVTYPE)]))
astr2<- astr[EVTYPE %in% astr2,]
astr2 <- cSplit(indt=astr2, splitCols = "EVTYPE", sep = ",",
direction = "long")
astr2 <- astr2[,paste0(c(7:9,11),"_m"):=
lapply(.SD, function(x) x/3), .SDcols = c(7:9,11)][
,c(1:6,14:16,10,17,12:13)]
setnames(astr2,c(7:9,11),
c("DEATHS","INJURIES","PROPDMG","CROPDMG"))
astr<-rbind(astr0,astr1,astr2)
setorder(astr,REFNUM)
astr
}multmult <- function(d,e) {
e <- toupper(e)
d <- ifelse(e %in% c("","-"), d,
ifelse(e %in% c("?","+"), as.numeric(d)*10^1,
ifelse(e == "H", as.numeric(d)*10^2,
ifelse (e == "K", as.numeric(d)*10^3,
ifelse(e == "M", as.numeric(d)*10^6,
ifelse (e =="B", as.numeric(d)*10^9,
paste0(d,e)))))))
d<- round(as.numeric(d)/10^6,2)
d
}CPIadjCPIadj <- function(data){
getSymbols(Symbols = "CPIAUCSL", src="FRED")
mon <- as.data.frame(CPIAUCSL)
rm(CPIAUCSL)
mon <- cbind(day = rownames(mon), mon)
rownames(mon) <- c()
mon$day <- as.character(mon$day)
mon$YEAR <- year(as.IDate(mon$day, "%Y-%m-%d"))
CPI <- mon %>% group_by(YEAR) %>%
summarise(cpi = mean(CPIAUCSL)) %>%
as.data.frame()
rm(mon)
CPI$adj <- CPI$cpi/CPI$cpi[CPI$YEAR == 2019]
CPI <- CPI %>% filter(YEAR >= data[, min(year(BGN_DATE))] &
YEAR <= data[, max(year(BGN_DATE))])
CPI$cpi <- NULL
CPI
}sessionInf (for reproducibility)R version 4.0.3 (2020-10-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 10.16
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] R.utils_2.10.1 R.oo_1.24.0 R.methodsS3_1.8.1
[4] splitstackshape_1.4.8 DT_0.16 plotly_4.9.2.1
[7] gridExtra_2.3 ggplot2_3.3.2 quantmod_0.4.18
[10] TTR_0.24.2 xts_0.12.1 zoo_1.8-8
[13] tidyr_1.1.2 dplyr_1.0.2 data.table_1.13.4
loaded via a namespace (and not attached):
[1] pillar_1.4.7 compiler_4.0.3 tools_4.0.3 digest_0.6.27
[5] viridisLite_0.3.0 jsonlite_1.7.2 evaluate_0.14 lifecycle_0.2.0
[9] tibble_3.0.4 gtable_0.3.0 lattice_0.20-41 pkgconfig_2.0.3
[13] rlang_0.4.9 crosstalk_1.1.0.1 curl_4.3 yaml_2.2.1
[17] xfun_0.19 httr_1.4.2 withr_2.3.0 stringr_1.4.0
[21] knitr_1.30 htmlwidgets_1.5.2 generics_0.1.0 vctrs_0.3.5
[25] grid_4.0.3 tidyselect_1.1.0 glue_1.4.2 R6_2.5.0
[29] rmarkdown_2.5 farver_2.0.3 purrr_0.3.4 magrittr_2.0.1
[33] scales_1.1.1 ellipsis_0.3.1 htmltools_0.5.1.1 colorspace_2.0-0
[37] labeling_0.4.2 stringi_1.5.3 lazyeval_0.2.2 munsell_0.5.0
[41] crayon_1.3.4