We try to answer two questions that might be answered by the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database:
Across the United States, which types of events are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?
We will need the folowing R packages to do this analysis
library(tidyverse)
library(janitor)
library(readr)
library(tidyr)
library(dplyr)
library(stringr)
library(lubridate)
library(inspectdf)
library(fuzzyjoin)
library(ggplot2)
library(forcats)
The data is in this file. The following code loads it into the variable storm_data
#We expect the data is in the project_data subdirectory. If we cannot find it there, we will download it
if (!file.exists(file.path("project_data", "StormData.csv.bz2")))
{
download.file(
"https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
"project_data/StormData.csv.bz2"
)
}
project_data <-
read_csv(
file.path("project_data", "StormData.csv.bz2"),
trim_ws = TRUE,
col_types =
cols(
STATE__ = col_double(),
BGN_DATE = col_character(),
BGN_TIME = col_character(),
TIME_ZONE = col_character(),
COUNTY = col_double(),
COUNTYNAME = col_character(),
STATE = col_character(),
EVTYPE = col_character(),
BGN_RANGE = col_double(),
BGN_AZI = col_character(),
BGN_LOCATI = col_character(),
END_DATE = col_character(),
END_TIME = col_character(),
COUNTY_END = col_character(),
COUNTYENDN = col_character(),
END_RANGE = col_double(),
END_AZI = col_character(),
END_LOCATI = col_character(),
LENGTH = col_double(),
WIDTH = col_double(),
F = col_double(),
MAG = col_double(),
FATALITIES = col_double(),
INJURIES = col_double(),
PROPDMG = col_double(),
PROPDMGEXP = col_character(),
CROPDMG = col_double(),
CROPDMGEXP = col_character(),
WFO = col_character(),
STATEOFFIC = col_character(),
ZONENAMES = col_character(),
LATITUDE = col_double(),
LONGITUDE = col_double(),
LATITUDE_E = col_double(),
LONGITUDE_ = col_double(),
REMARKS = col_character(),
REFNUM = col_double()
)
)
The data has been loaded into the project_data variable.
The variables that seem to have the most economic meaning are: - PROPDMG Property damage - PROPDMGEXP Property damage multiplier - CROPDMG Crop damage - CROPDMGEXP Crop damage multiplier
As the documentation for the dataset states, these are estimates of damages caused by the event and they are captured in dollar value. Lets have a look at the shape of them:
project_data%>%select(contains("DMG"))%>%summary()
## PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## Min. : 0.00 Length:902297 Min. : 0.000 Length:902297
## 1st Qu.: 0.00 Class :character 1st Qu.: 0.000 Class :character
## Median : 0.00 Mode :character Median : 0.000 Mode :character
## Mean : 12.06 Mean : 1.527
## 3rd Qu.: 0.50 3rd Qu.: 0.000
## Max. :5000.00 Max. :990.000
We are interested in events that actually caused damage, so it seems reasonable to do without anything having zero in both DMG variables. Also, if there should be any NA, we are not interested in them:
project_data %>%
#Recodes na in damage values as zero
mutate(PROPDMG=ifelse(is.na(PROPDMG),0,PROPDMG),
CROPDMG=ifelse(is.na(CROPDMG),0,CROPDMG),
TOTDMG=PROPDMG+CROPDMG
)%>%
#Requires either one to be non zero
filter(TOTDMG>0) ->current_data
current_data%>%select(contains("DMG"))%>%summary()
## PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## Min. : 0.00 Length:245031 Min. : 0.000 Length:245031
## 1st Qu.: 2.00 Class :character 1st Qu.: 0.000 Class :character
## Median : 8.00 Mode :character Median : 0.000 Mode :character
## Mean : 44.42 Mean : 5.623
## 3rd Qu.: 25.00 3rd Qu.: 0.000
## Max. :5000.00 Max. :990.000
## TOTDMG
## Min. : 0.01
## 1st Qu.: 2.50
## Median : 10.00
## Mean : 50.04
## 3rd Qu.: 30.00
## Max. :5000.00
Now we have this new dataset, current_data, that has a TOTDMG variable with the sum of property and crop damage. Lets see how those components of total damage are behaving by sampling 10 of them:
current_data%>%sample_n(10)%>%select(contains("DMG"))
The documentation for this dataset states:
Estimates should be rounded to three significant digits, followed by an alphabetical character signifying the magnitude of the number, i.e., 1.55B for $1,550,000,000. Alphabetical characters used to signify magnitude include “K” for thousands, “M” for millions, and “B” for billions.
So the alphabetical character reffered to in the documentation seems to be the content of the *EXP columns and, as stated, they represent a multiplier or index of magnitude of the damage in dollars.
So lets see if we have NAs in this rather important variables for multiplying estimates of damages:
current_data%>%select(contains("DMG"))%>%inspect_na()
The magnitude multipliers have a large number of NA. Now, these are only significant if they have a corresponding larger than 0 value in their DMG counterpart variable. Lets see the CROP columns with only rows that actually have damage in them, but have NA in their multiplier “EXP” column:
current_data %>% filter(is.na(CROPDMGEXP)&CROPDMG>0)%>%select(contains("DMG"))
There are very few rows with values that have NA as multiplier in CROPDMGEXP and some value in CROPDMG.
However, if we were just to drop the row due to those NA, we might loose some PROPDMG correctly encoded value. The preceeding table shows that the total damage lost in that case would be 5’505,000 USD in total.
Lets check the property side of the damages with the same strategy:
current_data %>% filter(is.na(PROPDMGEXP)&PROPDMG>0)%>%select(contains("DMG"))
76 rows have value in property damage but no multiplier. If we were to get rid of them, would we loose a lot in CROPDMG?
current_data %>% filter(is.na(PROPDMGEXP)&PROPDMG>0 &CROPDMG>0)%>%select(contains("DMG"))
No. We would loose about 74K worth of crop damages.
So since we dont have much to loose, lets get rid of any and all rows that have an NA in either PROPDMGEXP or CROPDMGEXP.
current_data %>%
filter(!is.na(PROPDMGEXP))%>%
filter(!is.na(CROPDMGEXP))->current_data
current_data %>% select(contains("DMG")) %>% inspect_na()
So now the current_data set is the subset of the project_data that: - If it had an NA in any DMG variable, it was replaced with 0 - does not have zero in one or the other DMG value variables - absolutely no NAs were allowed in the magnitude multipliers. Those rows were removed.
Lets see what multipliers actually are there in this variables:
current_data%>%filter(PROPDMG!=0)%>%pull(PROPDMGEXP)%>%table
## .
## 0 3 5 B K m M
## 4 1 2 24 90175 1 3924
So the most important PROPDMGEXP are clearly K,M,B. We might only use this ones. However, the documentation also states that:
If additional precision is available, it may be provided in the narrative part of the entry.
So lets have a look at this narrative data for those seemingly not as signifficant damage types:
current_data%>%filter(PROPDMG!=0)%>%pull(PROPDMGEXP)%>%table%>%names()->mults
mults[! mults %in% c("K","B","M")] -> mults
current_data%>%filter(PROPDMGEXP %in% mults & !is.na(REMARKS) & PROPDMG!=0) %>%
select(PROPDMG,PROPDMGEXP,REMARKS)
As can be seen from the remarks, this records do encode property damage. However, it seems most of them are problematic estimates that we could safely ignore, or they were finally put in the remarks.
We will thus only accept records with K, M and B as multipliers
Lets do the same with the CROPDMGEXP variable:
current_data%>%filter(CROPDMG!=0)%>%pull(CROPDMGEXP)%>%table
## .
## 0 B k K m M
## 11 3 21 16297 1 1477
current_data%>%filter(CROPDMG!=0)%>%pull(CROPDMGEXP)%>%table%>%names()->mults
mults[! mults %in% c("K","B","M")] -> mults
current_data%>%filter(CROPDMGEXP %in% mults & !is.na(REMARKS) & CROPDMG!=0) %>%
select(CROPDMG,CROPDMGEXP,REMARKS)
Again Hurricane Opal and then some other damages. We will only accept records with B,K and M as multipliers.
We will now create plain numeric, full dollar value columns PROPERTY_DAMAGE and CROP_DAMAGE, multiplying accordingly. We will also summarize both economic damage columns into TOTAL_DAMAGE, which will now have the mutliplier taken into account:
current_data %>% mutate(
PROP_MULT=recode(
PROPDMGEXP,K=1000,M=1000000,B=1000000000,.default=0),
CROP_MULT=recode(
CROPDMGEXP,K=1000,M=1000000,B=1000000000,.default=0),
PROPERTY_DAMAGE=PROPDMG*PROP_MULT,
CROP_DAMAGE=CROPDMG*CROP_MULT,
TOTAL_DAMAGE=PROPERTY_DAMAGE+CROP_DAMAGE
)->current_data
current_data %>% select(contains("CROP"),contains("DMG"),contains("DAMAGE")) %>%summary()
## CROPDMG CROPDMGEXP CROP_MULT
## Min. : 0.0 Length:95707 Min. :0.000e+00
## 1st Qu.: 0.0 Class :character 1st Qu.:1.000e+03
## Median : 0.0 Mode :character Median :1.000e+03
## Mean : 11.9 Mean :4.843e+04
## 3rd Qu.: 0.0 3rd Qu.:1.000e+03
## Max. :978.0 Max. :1.000e+09
## CROP_DAMAGE PROPDMG PROPDMGEXP
## Min. :0.000e+00 Min. : 0.00 Length:95707
## 1st Qu.:0.000e+00 1st Qu.: 2.00 Class :character
## Median :0.000e+00 Median : 8.00 Mode :character
## Mean :3.209e+05 Mean : 42.17
## 3rd Qu.:0.000e+00 3rd Qu.: 25.00
## Max. :5.000e+09 Max. :5000.00
## TOTDMG PROPERTY_DAMAGE TOTAL_DAMAGE
## Min. : 0.01 Min. :0.000e+00 Min. :0.000e+00
## 1st Qu.: 3.00 1st Qu.:3.000e+03 1st Qu.:3.000e+03
## Median : 10.00 Median :1.000e+04 Median :1.000e+04
## Mean : 54.08 Mean :2.411e+06 Mean :2.732e+06
## 3rd Qu.: 36.00 3rd Qu.:3.500e+04 3rd Qu.:5.000e+04
## Max. :5000.00 Max. :1.150e+11 Max. :1.150e+11
We can now see a more interesting picture. We only have rows with valid and significant damage data in dollars. By the rows removed, we are seeing an order of magnitude of difference, as project_data has 902297 observations while current_data only has 95707.
However, as we are tasked with both health and economic data, we still need to have all this values in the main table, plus whatever health data we might need. To do that, an unique per row id table is ideal, as this would allow us to join the economic data we have processed back into the main table
project_data%>%select(REFNUM) %>%distinct() %>% nrow()
## [1] 902297
project_data%>%nrow()
## [1] 902297
As we can see, the full table has an id column called REFNUM. It is the same as the one in current_data. So now we can join by that column:
current_data %>% select(contains("CROP"),contains("DMG"),contains("DAMAGE"),contains("REFNUM")) ->econ_data
project_data<-left_join(project_data,econ_data,by=c("REFNUM"="REFNUM"))
rm(current_data)
rm(econ_data)
gc(full = TRUE)
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 1487995 79.5 2759305 147.4 1946081 104.0
## Vcells 69192050 527.9 212395949 1620.5 212395105 1620.5
This leaves project_data with all its original rows and the value new columns. Where the refnum matched, it will have the corresponding DAMAGE values. Where it did not, the value for those columns will be NA.
Both questions we are attempting to answer require us to categorize events by the EVTYPE variable content. Lets have a look at it:
project_data %>% distinct(EVTYPE) %>% arrange(EVTYPE)
It seems that the event types repeat themselves in meaning. For example:
Sometimes by letter case and plurals: - Frost/Freeze
- FROST/FREEZE - HAIL/WIND
- HAIL/WINDS
And that they encode subcategories at some points, separated by a “/” character: - DUST STORM
- DUST STORM/HIGH WINDS
Some are invalid characters or present troublesome features.
The variation seems very significant, as there are close to 1000 distinct event types that seem to have a lot of capture errors.
Lets see if some of them are more or less important for our purposes:
library(dplyr)
project_data%>%
pull(EVTYPE) %>%
table %>%
sort(decreasing = TRUE) %>% head(30)
## .
## HAIL TSTM WIND THUNDERSTORM WIND
## 288661 219944 82563
## TORNADO FLASH FLOOD FLOOD
## 60652 54278 25326
## THUNDERSTORM WINDS HIGH WIND LIGHTNING
## 20843 20212 15755
## HEAVY SNOW HEAVY RAIN WINTER STORM
## 15708 11723 11433
## WINTER WEATHER FUNNEL CLOUD MARINE TSTM WIND
## 7026 6839 6175
## MARINE THUNDERSTORM WIND WATERSPOUT STRONG WIND
## 5812 3797 3566
## URBAN/SML STREAM FLD WILDFIRE BLIZZARD
## 3392 2761 2719
## DROUGHT ICE STORM EXCESSIVE HEAT
## 2488 2006 1678
## HIGH WINDS WILD/FOREST FIRE FROST/FREEZE
## 1533 1457 1342
## DENSE FOG WINTER WEATHER/MIX TSTM WIND/HAIL
## 1293 1104 1028
So some events are more frequent than others, but as we are interested in economic value and health data, we cannot just reduce this by eliminating the less frequent ones.
Fortunately, there is an authoritative table of event types in the documentation for the NOAA dataset downloaded at: https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf. In section 2, it is stated:
Permitted Storm Data Events.The only events permitted in Storm Data are listed in Table 1 of Section 2.1.1. The chosen event name should be the one that most accurately describes the meteorological event leading to fatalities, injuries, damage, etc. However, significant events, such as tornadoes, having no impact or causing no damage, should also be included in Storm Data. See Section 7 for detailed examples.
Said table in the PDF encodes very similarly named, but not exactly, events as we have seen in the data. This is a screenshot of it.
Note: i did not regard this screenshot as a “figure”. It is evidence of due diligence and documentation
I encoded it in a csv file called event_table.csv, which I will load now
authorized_types<-read_csv("./project_data/event_types.csv",col_types=cols(
event_name = col_character()
),trim_ws=TRUE)
head(authorized_types)
Idealy, we would want to keep any and all rows from our economically significant dataset in health_data. We will try several strategies for this. The first thing is to know the full economic damage. This is straightforward:
sum(project_data$TOTAL_DAMAGE,na.rm=TRUE)%>%prettyNum(big.mark = ",")
## [1] "261,470,112,580"
Whatever we do with the EVTYPEs, we must account for that much of economic value.
A similar meassure of health data impact by category can be achieved by adding injuries and fatalities:
project_data%>%mutate(TOTAL_HEALTH_PROBLEMS=INJURIES+FATALITIES) ->project_data
sum(project_data$TOTAL_HEALTH_PROBLEMS)%>%prettyNum(big.mark = ",")
## [1] "155,673"
So this two figures are the total economic value and the total health impact respectively. Whatever we do to the categorize EVTYPES better, the total categorized values should approach this and still preserve some insight of the categories used.
From the above, we can say that the authorized event names table still has some unfortunate, for us, characteristics. For example, HURRICANE (TYPHOON), which is an expression of synonymity, prevents matching for anything containing the word hurricane if we are testing by equality of strings. Also, EXTREME COLD/WIND CHILL presents the same kind of problem.
As a solution I will start on the authorized_types_m dataframe. Its “upper” column contains the authorized name in uppercase. Using that I will create a “matcher” column which will contain a regular expression to match for either word in synonyms. Thus HURRICANE (TYPHOON) will become (HURRICANE)|(TYPHOON), which will match anything in the EVTYPE column of the dataset, with the words HURRICANE or TYPHOON in any order or context.
We will do the same with authorized names that use the “/” character.
Recall that we have two dataframes for this work. The orig_events_type_m has all the unique names of the EVTYPE columns and an upper column for them.
orig_event_types_m%>%str
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 977 obs. of 2 variables:
## $ EVTYPE: chr "TORNADO" "TSTM WIND" "HAIL" "FREEZING RAIN" ...
## $ upper : chr "TORNADO" "TSTM WIND" "HAIL" "FREEZING RAIN" ...
From the authorized names table, we constructed authorized_types_m. It has the authorized name in the event_name column and an upper column, which has the same thing but in uppercase.
authorized_types_m %>% str()
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 48 obs. of 2 variables:
## $ event_name: chr "Astronomical Low Tide" "Avalanche" "Blizzard" "Coastal Flood" ...
## $ upper : chr "ASTRONOMICAL LOW TIDE" "AVALANCHE" "BLIZZARD" "COASTAL FLOOD" ...
It is this last one to which we add the matcher column with a regular expression that tries to capture all or most of the related names in the upper column of the orig_event_types_m:
authorized_types_m %>% mutate(matcher=
str_replace_all(
trimws(upper),
c(
#replaces parenthezied names with an "or" regex
"([^\\( ]*) \\(([^\\)]*)\\)"="(\\1|\\2)",
#replaces "/" names with an "or" regex
"(.*)/(.*)"="(\\1|\\2)")
)) ->authorized_types_m
authorized_types_m
Lets see if we can match better with this matcher column and a regular expression matcher for a join operation.
This can be accomplished by the fuzzyjoin::regex_*_join family of functions.
First we will obtain all the events that actually did match with an equal join, so we may remove them from the regex join:
inner_join(orig_event_types_m,authorized_types_m,c(upper="upper"))%>%
mutate(match_type="equal")-> straight_equal_joined
#this variable has all events that have identical names in both tables
#We will now create another table that will join by regex, but only
#with the events that were not matched by the inner_join
already_matched <- straight_equal_joined$upper
Now we filter those already matched and do our regex join:
orig_event_types_m%>%
filter(!upper %in% already_matched)%>%
regex_inner_join(.,authorized_types_m,
by=c(
upper="matcher")
) %>%
mutate(match_type="regex")-> regex_joined
Now we have two tables: straight_equal_joined and regex_joined. Each has a different subset of the full set of unique EVTYPE names, matched to the authorized events. We want that set to be one table but, to do that, they have to have the same columns so we can row_bind them.
#We rename the columns to have the column names
straight_equal_joined %>% select(
#The original EVTYPE column name
EVTYPE_NAME=EVTYPE,
#The authorized name in UPPERCASE
UPPER_AUTHORIZED=upper,
#The authorized name as in the docs
AUTHORIZED=event_name,
MATCH_TYPE=match_type
) -> straight_equal_joined
regex_joined %>% select(
EVTYPE_NAME=EVTYPE,
UPPER_AUTHORIZED=upper.y,
AUTHORIZED=event_name,
MATCH_TYPE=match_type
)->regex_joined
regex_joined
So far so good. However, from the output, you can see the regex_joined table has the problem of duplicating rows whenever am EVTYPE is of a combined type. That is, when one EVTYPE name matches two authorized names.
This is a treatable issue. We will treat that type of event as a new category for grouping value. We will create a new event category which we will call “Compound”, whenever the event name in EVTYPE comes under the classification of more than one authorized type name. We will group damage for those events separately in the analysis.
Lets extract the duplicated names and see how we might reclassify them:
regex_joined[duplicated(regex_joined$EVTYPE_NAME),]$EVTYPE_NAME ->duplicated_names
regex_joined %>% mutate(is_duplicated=(EVTYPE_NAME %in% duplicated_names))->regex_joined
regex_joined %>% filter(is_duplicated) -> only_duplicated_rows
only_duplicated_rows
Now, by looking at this, we still see names that are not really an expert classifying this as two types for the same event, but are capture issues:
| EVTYPE_NAME | UPPER_AUTHORIZED |
|---|---|
| FLASH FLOODING | FLASH FLOOD |
| FLASH FLOODING | FLOOD |
This should be classified as a plain FLASH FLOOD but because of the ING ending, it would be missclassified by us as a dual event compound. We will regard this as a special case.
Anotherone of the same problematic nature:
| EVTYPE_NAME | UPPER_AUTHORIZED |
|---|---|
| EXTREME COLD | COLD/WIND CHILL |
| EXTREME COLD | EXTREME COLD/WIND CHILL |
Should only match EXTREME COLD. Also, a problem with our regular expression approach.
These are the only two special cases in the name_equivalence table, with type regex, that yield a repeat authorized type classification, but would be missclassified as a “compound” event type as we propose.
We will “fix” this by creating a “special_cases” table, that will have the same columns but match_type=“by hand”, where we will extract only this strange and unique cases.
#Deal with the FLASH FLOODING special case
only_duplicated_rows%>%
#Select both rows
filter(EVTYPE_NAME == "FLASH FLOODING")%>%
#Change match type, we are doing it by hand
mutate(MATCH_TYPE="by hand")%>%
#Only keep the row that makes sense
filter(UPPER_AUTHORIZED=="FLASH FLOOD")->special_cases_1
#Remove from the original
only_duplicated_rows %>% filter(EVTYPE_NAME != "FLASH FLOODING") ->only_duplicated_rows
#Deal with the EXTREME COLD special case
only_duplicated_rows %>%
filter(EVTYPE_NAME=="EXTREME COLD") %>%
mutate(MATCH_TYPE="by hand")%>%
filter(UPPER_AUTHORIZED=="EXTREME COLD/WIND CHILL")->special_cases_2
only_duplicated_rows %>% filter(EVTYPE_NAME != "EXTREME COLD")->only_duplicated_rows
bind_rows(special_cases_1,special_cases_2) -> special_cases
special_cases
special_cases now has the only cases we know do not meet our assumptions about event types. We have removed those cases from only_duplicated_rows, for which we still need a solution.
As with the straight_joined, regex_joined and special cases, we will create a compound_types dataset with the subset of only_duplicated_rows that dont have special_cases. We will keep the same columns as the regex_joined, straight_joined tables, because in the end we need exactly one table with all the correspondence between authorized names and unique EVTYPEs. The values of the columns in this new table are as follows:
The following code accomplishes the described table:
only_duplicated_rows %>%
select(EVTYPE_NAME,AUTHORIZED) %>%
arrange(EVTYPE_NAME)%>%
#Create a "key" column to uniquely identify EVTYPE_NAMExAUTHORIZED combinations
mutate(key=paste0(EVTYPE_NAME,"_",AUTHORIZED))%>%
#Spread will create columns for each combination in key
#The value in each column will be the corresponding AUTHORIZED name for each
#EVTYPE_NAME. It will place NA where that EVTYPE_NAME did not match the AUTHORIZED
#and the name in AUTHORIZED where it did.
spread(key,AUTHORIZED) %>%
#Each row in this stage has a set of NAs or AUTHORIZED names that the EVTYPE matches to
#Unite will paste all columns, separated by comma, including NA as "NA"
unite("AUTHORIZED",-EVTYPE_NAME,sep = ",",remove=FALSE) %>%
#All the NA will be a string "NA". We need to remove them
mutate(AUTHORIZED=str_remove_all(AUTHORIZED,"NA,|NA|,NA"))%>%
#As we stated, the AUTHORIZED column will encode the value as Compound:
#to help us identifythis sort of compound categorization further downstream
mutate(AUTHORIZED=paste0("Compound: ",AUTHORIZED))%>%
#We want the AUTHORIZED for each repeated EVTYPE_NAME.
#any other remaining column is ignored
select(EVTYPE_NAME,AUTHORIZED) -> compound_evtypes
#We can now directly match all EVTYPE names that were matched by this strategy
#and extract them from the only_duplicated_rows table.
inner_join(compound_evtypes,only_duplicated_rows,
by=c("EVTYPE_NAME"="EVTYPE_NAME")) %>%
mutate(UPPER_AUTHORIZED=toupper(AUTHORIZED.x),
MATCH_TYPE="by hand: compound")%>%
#since the join renames some columns, we rename:
select(EVTYPE_NAME,
AUTHORIZED=AUTHORIZED.x,
UPPER_AUTHORIZED,
MATCH_TYPE) %>%
# #The above will still have repeated rows, although all repeated
# # will be identical. Thus, we use distinct()
distinct()->compound_types
Now we can remove from regex_joined any records that were NOT duplicated:
#lets get only the real regex matches
regex_joined %>% filter(!is_duplicated) %>%select(-is_duplicated)-> regex_joined_only_nondup
Now we have a number of subsets of the unique EVTYPE names that have matched to the authorized names by several strategies. For straight equality, this is the subset:
straight_equal_joined
We also have the special_cases, which has a few outlier cases:
special_cases
We also have the regex matches that did not cause duplicates:
regex_joined_only_nondup
And finally, the compound types, which were regex matches that represent two or more authorized names for each EVTYPE:
compound_types
Together, this tables, should add up to the total of distinct names in the EVTYPES variable of the original database, as extracted in health_data:
bind_rows(straight_equal_joined,
regex_joined_only_nondup,
special_cases,
compound_types) -> evtype_authorized_translation
nrow(evtype_authorized_translation)
## [1] 515
nrow(distinct(project_data,EVTYPE))
## [1] 977
As you can see, we are still missing quite a lot of names. Almost half of them. Lets look at how much dollar value or even health event information do this yet unclassified names can group. We need to know if we can ignore unclassified event types.
Lets calculate totals and percentage of damage value components:
project_data%>% ungroup()%>%summarise(
total_damage=sum(TOTAL_DAMAGE,na.rm=T),
crop_total=sum(CROP_DAMAGE,na.rm=T),
property_total=sum(PROPERTY_DAMAGE,na.rm=T),
total_health_prob=sum(TOTAL_HEALTH_PROBLEMS,na.rm=T),
total_fatal=sum(FATALITIES,na.rm=T),
total_injuries=sum(INJURIES,na.rm=T)
)->total_damages_sum
#Print out the totals:
total_damages_sum%>%prettyNum(",")
## total_damage crop_total property_total total_health_prob
## "261,470,112,580" "30,716,558,210" "230,753,554,370" "155,673"
## total_fatal total_injuries
## "15,145" "140,528"
And now lets calculate percentages vs total of each interesting variable when we only take into account EVTYPEs that we could NOT determine the kind of AUTHORIZED event name they belong to:
project_data%>%filter(! EVTYPE %in% evtype_authorized_translation$EVTYPE_NAME)%>% summarize(
total_damage_uncat=sum(TOTAL_DAMAGE,na.rm=T),
total_econ_percent=total_damage_uncat/total_damages_sum$total_damage*100,
crop_total_uncat=sum(CROP_DAMAGE,na.rm=T),
crop_percent=crop_total_uncat/total_damages_sum$crop_total*100,
property_total_uncat=sum(PROPERTY_DAMAGE,na.rm=T),
property_total_uncat_pct=property_total_uncat/total_damages_sum$property_total*100,
total_health_prob_uncat=sum(TOTAL_HEALTH_PROBLEMS,na.rm=T),
total_health_prob_uncat_pct=total_health_prob_uncat/total_damages_sum$total_health_prob*100,
total_fatal_uncat=sum(FATALITIES,na.rm=T),
total_fatal_uncat_pct=total_fatal_uncat/total_damages_sum$total_fatal*100,
total_injuries_uncat=sum(INJURIES,na.rm=T),
total_injuries_pct=total_injuries_uncat/total_damages_sum$total_injuries*100
) ->subtotals
subtotals%>%
prettyNum(big.mark = ",")
## total_damage_uncat total_econ_percent
## "1,509,605,910" "0.5773531"
## crop_total_uncat crop_percent
## "636,421,100" "2.071915"
## property_total_uncat property_total_uncat_pct
## "873,184,810" "0.3784058"
## total_health_prob_uncat total_health_prob_uncat_pct
## "10,360" "6.654975"
## total_fatal_uncat total_fatal_uncat_pct
## "881" "5.817101"
## total_injuries_uncat total_injuries_pct
## "9,479" "6.745275"
On the economics side, this is no problem. Not AUTHORIZED event types account for about 0.6% of the total economic value, 2.07% of the crop damage, 0.3% of the property damage.
On the health side, we are still on about 7% loss. Its still a bit too much, so lets try and pull out the top EVTYPES that we have not been able to classify automatically.
project_data%>%filter(! EVTYPE %in% evtype_authorized_translation$EVTYPE_NAME)%>%
group_by(EVTYPE)%>%
summarize(
total_damage_uncat=sum(TOTAL_DAMAGE,na.rm=T),
total_econ_percent=total_damage_uncat/total_damages_sum$total_damage*100,
crop_total_uncat=sum(CROP_DAMAGE,na.rm=T),
crop_percent=crop_total_uncat/total_damages_sum$crop_total*100,
property_total_uncat=sum(PROPERTY_DAMAGE,na.rm=T),
property_total_uncat_pct=property_total_uncat/total_damages_sum$property_total*100,
total_health_prob_uncat=sum(TOTAL_HEALTH_PROBLEMS,na.rm=T),
total_health_prob_uncat_pct=total_health_prob_uncat/total_damages_sum$total_health_prob*100,
total_fatal_uncat=sum(FATALITIES,na.rm=T),
total_fatal_uncat_pct=total_fatal_uncat/total_damages_sum$total_fatal*100,
total_injuries_uncat=sum(INJURIES,na.rm=T),
total_injuries_pct=total_injuries_uncat/total_damages_sum$total_injuries*100
)%>%
mutate(dmg_pct_total=total_damage_uncat/sum(total_damage_uncat)*100,
cum_dmg=cumsum(dmg_pct_total),
hlth_pct_total=total_health_prob_uncat/sum(total_health_prob_uncat)*100,
cum_hlth=cumsum(hlth_pct_total))%>%
select(EVTYPE,dmg_pct_total,cum_dmg,hlth_pct_total,cum_hlth)%>%
arrange(desc(hlth_pct_total,dmg_pct_total))->uncat_events_values
uncat_events_values
This is the cummulative sum of the percentage of damage captured by event type, ordered by the sum of health percentages. As it is immediatly apparent, the top 10 names ordered by percentage of health data captured will capture 100% of the value for both health and economic data.
How will we recategorize these names? There are obvious missmatches that we can treat by hand:
| EVTYPE | AUTHORIZED |
|---|---|
| TSTM | Thunderstorm Wind |
| FOG | Dense Fog |
| WILD/FOREST FIRE | Wildfire |
| GLAZE | Extreme Cold/Wind Chill |
| WILD FIRES | Wildfire |
| ICE | Extreme Cold/Wind Chill |
| WIND | High Winds |
| URBAN/SML STREAM FLD | Flood |
| LANDSLIDE | as is |
| WINTRY MIX | as is |
Its straightforward to do this in code. We just need to construct a table that can we can bind_rows to evtype_authorized_translation:
tibble(
EVTYPE_NAME=c("TSTM WIND",
"FOG",
"WILD/FOREST FIRE",
"GLAZE",
"WILD FIRES",
"ICE",
"WIND",
"URBAN/SML STREAM FLD",
"LANDSLIDE",
"WINTRY MIX"),
AUTHORIZED=c("Thunderstorm Wind",
"Dense Fog",
"Wildfire",
"Extreme Cold/Wind Chill",
"Wildfire",
"Extreme Cold/Wind Chill",
"High Winds",
"Flood",
"Landslide",
"Wintry Mix")
)%>%mutate(UPPER_AUTHORIZED=toupper(AUTHORIZED),
MATCH_TYPE="By hand: unmatched")%>%
#Make sure to preserve the order as need in the translation table
select(EVTYPE_NAME,UPPER_AUTHORIZED,AUTHORIZED,MATCH_TYPE)->final_unmatched
#Bind this new translation elements to the authorized_translation table
bind_rows(evtype_authorized_translation,final_unmatched)->evtype_authorized_translation
All the data is now cleaned, and we can freely group damages by any event type that groups significant values.
We will now summarize those values per AUTHORIZED, which is the column where the authorized event type is:
#remember that total_damages_sum has the sum totals of every variable we care about
project_data%>%
filter(TOTAL_DAMAGE>0 | TOTAL_HEALTH_PROBLEMS>0)%>%
mutate(AUTHORIZED=str_replace_all(AUTHORIZED,"Compound.*","Compound"))%>%
group_by(AUTHORIZED)%>%
summarize(
econ_damage_total=sum(TOTAL_DAMAGE,na.rm=T),
econ_damage_pct=(econ_damage_total/total_damages_sum$total_damage)*100,
econ_crop_total=sum(CROP_DAMAGE,na.rm=T),
econ_crop_pct=(econ_crop_total/econ_damage_total)*100,
econ_property_total=sum(PROPERTY_DAMAGE,na.rm=T),
econ_property_pct=(econ_property_total/econ_damage_total)*100,
health_incidents_total=sum(TOTAL_HEALTH_PROBLEMS,na.rm=T),
health_incidents_pct=(health_incidents_total/total_damages_sum$total_health_prob)*100,
health_fatalities_total=sum(FATALITIES,na.rm=T),
health_fatalities_pct=(health_fatalities_total/health_incidents_total)*100,
health_injuries_total=sum(INJURIES,na.rm=T),
health_injuries_pct=(health_injuries_total/health_incidents_total)*100
)%>%
mutate_if(is.character,as_factor)->total_damages_global_percentages
This new summarized table is rather important as it encodes all our variables grouped by type and calculates the percentage contribution to each total for each type in each variable.
With it, we can extract the most valuable groups for each of our variables. Remember, we care about two results: economic damage and health problems. And of those, we have several variables crop, property, injuries and fatalities, with two type of values, which are total and pct
total_damages_global_percentages%>%ungroup()%>%
#Only percentages
gather("type",value,contains("pct"),contains("total"))%>%
separate(type,c("result","variable","type_val"),sep="_")%>%
mutate_if(is.character,forcats::as_factor)%>%
na.omit()%>%
group_by(AUTHORIZED,result,type_val,variable)%>%arrange(value)->total_damages_gathered
# View(total_damages_gathered)
Now that we have all our economic data summarized by event type and classified by type of damage (total, crop and property), we can get a picture of how that is distributed for the economic variables.
econ_facet_names<-c("crop"="Crop damage per event",
"property"="Property damage per event",
"damage" = "Total damage per event" )
total_damages_gathered%>%ungroup()%>%
filter(result=="econ" &
str_detect(type_val,"total")&
str_detect(variable,"crop|property|damage"))%>%
na.omit()->total_econ_to_plot
total_econ_to_plot%>%
ggplot(aes(x=fct_reorder(AUTHORIZED,value,.desc=FALSE),
y=value,
fill=value,
position="fill"
)
)+
geom_col()+
coord_flip()+
facet_wrap(.~variable,
scales = "free",
labeller = as_labeller(econ_facet_names))+
theme(axis.text.x = element_text(angle = 90, hjust = 1))
A thing to notice here is the scale of dollar value. Crops are on the order of 1 with ten zeroes after at their worst, but property and damage scales go for an order of magnitude more. Having that in mind, we can conclude here that:
health_facet_names<-c("fatalities"="Fatalities per event",
"injuries"="Injuries per event",
"incidents" = "Total health incidents per event" )
total_damages_gathered%>%ungroup()%>%
filter(result=="health" &
str_detect(type_val,"total")&
str_detect(variable,"injuries|fatalities|incidents"))%>%
na.omit()->total_health_to_plot
total_health_to_plot %>%
group_by(variable)%>%
mutate(order_within=row_number())%>%
arrange(order_within,value)%>%ungroup()->total_health_to_plot
total_health_to_plot%>%
filter(value!=0)%>%
ggplot(aes(x=fct_reorder(AUTHORIZED,value,.desc=FALSE),
y=value,
fill=value,
position="fill"
)
)+
geom_col()+
coord_flip()+
facet_wrap(.~variable,
scales = "free",
labeller = as_labeller(health_facet_names))+
theme(axis.text.x = element_text(angle = 90, hjust = 1))
Conversely, we can see here fatalities and injuries are also in different scales for value. However, it is easy to see that even it differs how much of one or the other do we have per event, most of them are the same events.
Worst events in general The top 15 events in the total damages column is mostly the same, although with different ranking if you view them by injuries or fatalities.
Top killers: Avalanches, high surf, Cold/Wind chill, landslide, storm surge/tide are killers that dont seem to conrrespond in the injuries.
Injurers: The top 15 events shown are the top injurers.
As a final view of all damages across the U.S., we now present the total contribution of each state, by event type and type of damage, as a percentage of total damages summarized as economic and health damages, to see how they all rank by that meassure.
project_data %>% group_by(AUTHORIZED,STATE) %>% summarize(INJURIES=sum(INJURIES,na.rm=TRUE),
FATALITIES=sum(FATALITIES,na.rm=TRUE),
TOTAL_HEALTH_PROBLEMS=INJURIES+FATALITIES,
CROP_DAMAGE=sum(CROP_DAMAGE),
PROPERTY_DAMAGE=sum(PROPERTY_DAMAGE),
TOTAL_DAMAGE=CROP_DAMAGE+PROPERTY_DAMAGE,
TOTAL_ALL=TOTAL_HEALTH_PROBLEMS+TOTAL_DAMAGE) %>%
filter(TOTAL_ALL>0)%>%ungroup() %>% mutate(PCT_INJ=percent_rank(INJURIES),P_FAT=percent_rank(FATALITIES),
PCT_PROPERTY=percent_rank(PROPERTY_DAMAGE),
PCT_CROP=percent_rank(CROP_DAMAGE),
PCT_HTOT=percent_rank(TOTAL_HEALTH_PROBLEMS),
PCT_TOTDMG=percent_rank(TOTAL_DAMAGE),
STATE=as_factor(STATE),
AUTHORIZED=as_factor(AUTHORIZED)) ->by_state_contribution
by_state_contribution %>%
select(AUTHORIZED,STATE,matches("PCT_.*"))%>%
gather("type_dmg","value",matches("PCT_.*")) %>%
filter(value>0)%>%
mutate(type_dmg=as_factor(type_dmg))-> percentages_by_state_and_event
percentages_by_state_and_event %>%
spread(key = type_dmg,value = value)%>%mutate_at(vars(matches("PCT_*")), ~ifelse(is.na(.),0,.))->percentages_spread_and_cleaned
percentages_spread_and_cleaned%>%
mutate(AUTHORIZED=stringr::str_replace_all(AUTHORIZED,"Compound:","Cpd:"))%>%
ggplot(aes(PCT_TOTDMG,
PCT_HTOT,
color=STATE))+geom_point()+facet_wrap(AUTHORIZED~.)+
geom_label(label=percentages_spread_and_cleaned$STATE,nudge_x = 0.1,nudge_y = 0.15)+
ggtitle("Health problems vs. Economic Damage",subtitle = "Contributing States Per Event Type")+
xlab("Percentage of compounded (crop+property) economic damage")+
ylab("Percentage of compounded (injuries+fatalities) health damage")+
theme(legend.position="none")
So the more to the right a state label appears, the more percentage of economic damage it contributed to the total damage for that meteorological event type. The more to the top, the more health damage happened in the given state for that event type.
With this final graph and the other two supporting plots, a policy maker might rank event types as the most dangerous by injury or fatality, the most economically onerous by property or crop and, finally, where (which states), have suffered the most of the compound health (injuries+deaths) and economic (property+crops) damage.