Analysis of Storm Data

Synopsis

Extreme weather events like storms and hurricanes can have severe consequences on both population health and the economy. One key concern of the administration is therefore to prevent negative consequences of such events. The U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database collects the impact of such weather events on U.S. soil. Analysis of this database allows the administration to decide how and where to best spend available resources.

This report explores this storm database and answers the following questions:

Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?

This report shows that tornados are most harmful for population health, while floods result in the largest economic loss.

Data Processing

First, load some libraries we are going to use in the rest of the analysis

require(readr)
require(dplyr)
require(tidyr)
require(knitr)
require(lubridate)
require(ggplot2)
require(kableExtra)

Download Storm Data File

We are now downloading the NOAA storm database and read it into an R dataframe:

URL <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2";
FN <- "StormData.csv.bz2";
download.file(URL,destfile = FN);
df <- read_csv(FN);
df_orig <- df;

Data Preprocessing

Pruning data

Many of the rows are not interesting for us as they have neither human nor economic consequences. We prune the corresponding rows.

  df <- df %>% 
    filter(FATALITIES > 0 | INJURIES > 0 | CROPDMG > 0 | PROPDMG > 0 )

Before this pruning, there were 902297 rows. After the pruning, there are 254633, a reduction by 71.78%. Some of the rows appear to be duplicates. There are many rows with identical content except for the REFNUM column. We will eliminate these duplicates.

  dupesWhenIgnoringRefnum <- df %>% 
    select(-REFNUM) %>% 
    duplicated();
  
  df <- df %>% 
    mutate(dupesWhenIgnoringRefnum=dupesWhenIgnoringRefnum);
  
  df_WithDuplicateInformation <- df;

  # after this operation, df does not contain duplicates due to REFNUM
  df <- df_WithDuplicateInformation %>%
    filter(dupesWhenIgnoringRefnum==FALSE) %>%
    select(-dupesWhenIgnoringRefnum)

We still have 253894 rows, but we achieved a further reduction by 0.29%. There are probably far more duplicates. For example, some rows are identical except for some field being NA in one row, while the duplicate candidate has some value in the same field. There are also some duplicate candidates where the only difference is some different spelling in the REMARKS field. However, we can not generalize this finding - sometimes, there is more than just a spelling difference, sometimes, there is a different BGN_LOCATI and so on. As we do not know exactly the origin of the duplicates, we will not continue to further prune the dataset due to duplicates.

Some more pruning, this time for the columns:

    # select only columns that we need
    df <- df %>% select(BGN_DATE,BGN_TIME,STATE,COUNTYNAME,
                        EVTYPE,
                        FATALITIES,INJURIES,
                        CROPDMG,CROPDMGEXP,
                        PROPDMG,PROPDMGEXP,
                        REFNUM,REMARKS)

    # We will only work with year and month of the event
    df <- df %>% mutate(BGN_YEAR=year(mdy_hms(BGN_DATE)),
                        BGN_MONTH=month(mdy_hms(BGN_DATE)))

Preparation for the Calculation of Economic Costs

The data contains fields describing economic (CROPDMG,PROPDMG) and human (FATALITIES,INJURIES) consequences. The values for CROPDMG and PROPDMG are associated with a multiplier (CROPDMGEXP, PROPDMGEXP). The associated information (section 2.7) only describes K (thousands) M (Millions) and B (Billions) as possible multipliers. However, insspecting the dataset reveals that far more mulitpliers have been used.

strHere <- "CROPDMGEXP";   
paste(c(strHere,": ",unique(df[[strHere]])), collapse=" ")

## [1] "CROPDMGEXP :  NA M K m B ? 0 k"

strHere <- "PROPDMGEXP";   
paste(c(strHere,": ",unique(df[[strHere]])), collapse=" ")

## [1] "PROPDMGEXP :  K M NA B m + 0 5 6 4 h 2 7 3 H -"

In the following, we will assume that lowercase letters should be uppercase:

strHere <- "CROPDMGEXP";   
paste(c(strHere,": ",unique(df[[strHere]])), collapse=" ")

## [1] "CROPDMGEXP :  NA M K m B ? 0 k"

strHere <- "PROPDMGEXP";
paste(c(strHere,": ",unique(df[[strHere]])), collapse=" ")

## [1] "PROPDMGEXP :  K M NA B m + 0 5 6 4 h 2 7 3 H -"

# cast lowercase letters to uppercase letters
df$CROPDMGEXP <- toupper(df$CROPDMGEXP);
df$PROPDMGEXP <- toupper(df$PROPDMGEXP);

strHere <- "CROPDMGEXP";
paste(c(strHere,": ",unique(df[[strHere]])), collapse=" ")

## [1] "CROPDMGEXP :  NA M K B ? 0"

strHere <- "PROPDMGEXP";
paste(c(strHere,": ",unique(df[[strHere]])), collapse=" ")

## [1] "PROPDMGEXP :  K M NA B + 0 5 6 4 H 2 7 3 -"

There are still a number of multipliers that we do not understand. One way to fix this problem is to go to the NOAA website and download interactively pieces of data that correspond to our missing multipliers.

require(dplyr)
require(knitr)

# CROPDMGEXP is NA
ddf <- df[is.na(df$CROPDMGEXP) & df$CROPDMG > 0,] %>%
  select(BGN_DATE,BGN_TIME,STATE,COUNTYNAME,CROPDMG,CROPDMGEXP,PROPDMG,PROPDMGEXP);
kable(ddf)

BGN_DATE	BGN_TIME	STATE	COUNTYNAME	CROPDMG	CROPDMGEXP	PROPDMG	PROPDMGEXP
7/4/1994 0:00:00	0400	ND	STUTSMAN	3	NA	5	K
4/5/1994 0:00:00	1700	TX	HAYS	4	NA	5	M
4/15/1994 0:00:00	1630	TX	MEDINA	4	NA	500	K

We download the corresponding data and find:

Location	County/Zone	St.	Date	Time	T.Z.	Type	Mag	Dth	Inj	PrD	CrD
Jamestown	STUTSMAN CO.	ND	07/04/1994	04:00	CST	Hail	1.75 in.	0	0	5.00K	0.00K
Jamestown	STUTSMAN CO.	ND	07/04/1994	04:55	CST	Hail	1.25 in.	0	0	5.00K	5.00K

We conclude that there was no crop damage for this event. Possibly, all NA values indicate that the corresponding Crop Damage should be zero? Downloading the data for the Texas case confirms that:

Location	County/Zone	St.	Date	Time	T.Z.	Type	Mag	Dth	Inj	PrD	CrD
San Marcos	HAYS CO.	TX	04/05/1994	17:00	CST	Thunderstorm Wind	52 kts.	0	0	5.000M	0.00K

We therefore conclude that cases where is.na(CROPDMGEXP) & CROPDMG>0 is TRUE is in fact a multiplier of 0.

Now we are going to look at the ? cases:

# CROPDMGEXP is ?
selector <- which(df$CROPDMGEXP == "?" & df$CROPDMG >0);
length(selector);

## [1] 0

We see that there are no cases with CROPDMGEXP=="?" & CROPDMG >0 as the length of this selector is 0.

The last case for crops is the 0. Let’s have a look at that:

# CROPDMGEXP=="0"
selector <- which(df$CROPDMGEXP == "0" & df$CROPDMG >0);
ddf <- df[selector,] %>%
  select(BGN_DATE,BGN_TIME,STATE,COUNTYNAME,CROPDMG,CROPDMGEXP,PROPDMG,PROPDMGEXP);
kable(ddf)

BGN_DATE	BGN_TIME	STATE	COUNTYNAME	CROPDMG	PROPDMG	PROPDMGEXP
7/17/1995 0:00:00	0258	GA	CAMDEN	20	0	NA
10/6/1994 0:00:00	2105	IA	BOONE	5	5	K
5/27/1995 0:00:00	1901	IA	CRAWFORD	50	100	K
10/6/1994 0:00:00	1630	IA	HANCOCK	25	5	K
10/6/1994 0:00:00	2225	IA	MITCHELL	25	25	K
5/27/1995 0:00:00	2119	IA	RINGGOLD	60	30	K
10/6/1994 0:00:00	2038	IA	WEBSTER	5	2	K
10/6/1994 0:00:00	2039	IA	WEBSTER	5	5	K
5/27/1995 0:00:00	2057	IA	WEBSTER	50	40	K
10/6/1994 0:00:00	2123	IA	WINNEBAGO	5	5	K
10/6/1994 0:00:00	2135	IA	WINNEBAGO	5	2	K
10/6/1994 0:00:00	2101	IA	WRIGHT	5	2	K

So there are 12 cases for this case. Let’s have a look at two cases:

Location	County/Zone	St.	Date	Time	T.Z.	Type	Mag	Dth	Inj	PrD	CrD
Spring Bluff	CAMDEN CO.	GA	07/17/1995	02:58	EST	Hail	0.75 in.	0	0	0.00K	0.20K
Mt Ayr to	RINGGOLD CO.	IA	05/27/1995	21:19	CST	Tornado	F1	0	0	30.00K	0.60K

These tow cases suggest that CROPDMGEXP=="0" corresponds to a multiplier of 10. Due to temporal constraints we are not investigating any further and get the following table for a mapping between CROPDMGEXP and multipliers:

EXP <- c("K","M","B","0","?","NA");
Multiplier <- c(1000,1e6,1e9,10,0,0);
Type <- rep("CROP",6);
mappingTableCrop <- data.frame(EXP=EXP, Multiplier=Multiplier, Type=Type);
kable(mappingTableCrop)

EXP	Multiplier	Type
K	1e+03	CROP
M	1e+06	CROP
B	1e+09	CROP
0	1e+01	CROP
?	0e+00	CROP
NA	0e+00	CROP

Let’s do the same thing for the Property damages. There are quite a bit more. We start with NA.

# PROPDMGEXP is "NA"
selector <- which(is.na(df$PROPDMGEXP) & df$PROPDMG >0);
ddf <- df[selector,] %>%
  select(BGN_DATE,BGN_TIME,STATE,COUNTYNAME,CROPDMG,CROPDMGEXP,PROPDMG,PROPDMGEXP);
kable(head(ddf))

BGN_DATE	BGN_TIME	STATE	COUNTYNAME	CROPDMGEXP	PROPDMG	PROPDMGEXP
3/9/1995 0:00:00	0301	CA	MARIN	?	0.41	NA
1/8/1993 0:00:00	1130	FL	UNION	NA	3.00	NA
8/19/1995 0:00:00	1700	GA	DECATUR	NA	2.00	NA
3/31/1993 0:00:00	2015	GA	HART	NA	4.00	NA
5/12/1993 0:00:00	1630	IN	CASS	NA	4.00	NA
5/14/1995 0:00:00	0300	IN	VANDERBURGH	NA	10.00	NA

There are 76 cases, but we only show the first 6 ones. A sample from the downloadable Data yields the following:

Location	County/Zone	St.	Date	Time	T.Z.	Type	Mag	Dth	Inj	PrD	CrD
Gainesville	UNION CO.	FL	01/08/1993	11:30	EST	Tornado	F0	0	0	0.00K	0.00K
CASS CO.	CASS CO.	IN	05/12/1993	16:30	EST	Thunderstorm Wind	0 kts.	0	0	0.00K	0.00K

We therefore conclude that NA values for PROPDMGEXP corresponds to a multiplier of 0.

We continue with H (we suspect this corresponds to 100):

# PROPDMGEXP == "H"
selector <- which(df$PROPDMGEXP=="H" & df$PROPDMG >0);
ddf <- df[selector,] %>%
  select(BGN_DATE,BGN_TIME,STATE,COUNTYNAME,CROPDMG,CROPDMGEXP,PROPDMG,PROPDMGEXP);
kable(ddf)

BGN_DATE	BGN_TIME	STATE	COUNTYNAME	CROPDMGEXP	PROPDMG	PROPDMGEXP
9/16/1994 0:00:00	1630	MI	CLINTON	NA	2	H
7/14/1995 0:00:00	1923	NE	SHERMAN	NA	5	H
1/14/1995 0:00:00	1050	SC	LAURENS	NA	5	H
3/8/1995 0:00:00	1305	SC	LAURENS	NA	2	H
1/19/1995 0:00:00	1815	SC	PICKENS	NA	3	H
7/12/1995 0:00:00	1300	WY	BIG HORN	NA	5	H
7/15/1995 0:00:00	1818	WY	CAMPBELL	NA	5	H

There are 7 cases in total. A sample from the downloadable Data yields the following:

Location	County/Zone	St.	Date	Time	T.Z.	Type	Mag	Dth	Inj	PrD	CrD
St. Johns	CLINTON CO.	MI	09/16/1994	16:30	EST	Thunderstorm Wind	50 kts.	0	0	0.20K	0.00K
BIG HORN CO.	BIG HORN CO.	WY	07/12/1995	13:00	MST	Thunderstorm Wind	54 kts.	0	0	0.50K	0.00K

The data confirm our suspicion for the two samples. We therefore conclude that H values for PROPDMGEXP corresponds to a multiplier of 100.

We continue with -:

# PROPDMGEXP == "-"
selector <- which(df$PROPDMGEXP=="-" & df$PROPDMG >0);
ddf <- df[selector,] %>%
  select(BGN_DATE,BGN_TIME,STATE,COUNTYNAME,CROPDMG,CROPDMGEXP,PROPDMG,PROPDMGEXP);
kable(ddf)

BGN_DATE	BGN_TIME	STATE	COUNTYNAME	CROPDMG	CROPDMGEXP	PROPDMG	PROPDMGEXP
12/12/1995 0:00:00	1000	OR	ORZ004 - 05 - 06 - 08 - 09	0	NA	15	-

There is only 1 case in total. The row seems to be corrupted as there is no proper county name. A search in the downloadable Data yields no hits for this date and Oregon. We therefore conclude that - corresponds to a multiplier of 0.

We continue with +:

# PROPDMGEXP == "+"
selector <- which(df$PROPDMGEXP=="+" & df$PROPDMG >0);
ddf <- df[selector,] %>%
  select(BGN_DATE,BGN_TIME,STATE,COUNTYNAME,CROPDMG,CROPDMGEXP,PROPDMG,PROPDMGEXP);
kable(ddf)

BGN_DATE	BGN_TIME	STATE	COUNTYNAME	CROPDMGEXP	PROPDMG	PROPDMGEXP
5/1/1995 0:00:00	0000	AK	AKZ001	NA	20	+
12/1/1994 0:00:00	0000	AK	AKZ024 - 006 - 005	NA	20	+
3/9/1995 0:00:00	1000	CA	CAZ001	NA	2	+
6/16/1995 0:00:00	0330	NV	NVZ001 - 002 - 003 - 004 - 005	NA	15	+
6/5/1995 0:00:00	1304	NV	NVZ003 - 004 - 007	NA	60	+

There are 5 cases in total. Again, the rows seem corrupted due to strange county names. We checked row 1 and 3 on the online NOAA database and did not find any hits corresponding to the entries in the table above. We therefore conclude that - corresponds to a multiplier of 0.

We continue with 0 (we suspect this corresponds to 10 as in the case of crops):

# PROPDMGEXP == "0"
selector <- which(df$PROPDMGEXP=="0" & df$PROPDMG >0);
ddf <- df[selector,] %>%
  select(BGN_DATE,BGN_TIME,STATE,COUNTYNAME,CROPDMG,CROPDMGEXP,PROPDMG,PROPDMGEXP);
kable(head(ddf,n=30))

BGN_DATE	BGN_TIME	STATE	COUNTYNAME	CROPDMGEXP	PROPDMG
12/3/1994 0:00:00	0000	AK	AKZ024 - 006 - 005	NA	150.0
8/10/1994 0:00:00	2100	CO	JEFFERSON	NA	50.0
8/10/1994 0:00:00	2130	CO	JEFFERSON	NA	5.0
8/11/1994 0:00:00	0030	CO	JEFFERSON	NA	37.0
8/12/1994 0:00:00	1300	CA	RIVERSIDE	NA	0.5
8/13/1994 0:00:00	1300	CA	RIVERSIDE	NA	0.5
8/10/1994 0:00:00	2100	CO	BOULDER	NA	1.0
8/10/1994 0:00:00	2120	CO	BOULDER	NA	1.0
8/1/1994 0:00:00	1415	CO	CLEAR CREEK	NA	50.0
8/13/1994 0:00:00	1345	CO	GILPIN	NA	1.5
5/2/1995 0:00:00	1820	CO	JEFFERSON	NA	20.0
8/10/1994 0:00:00	2100	CO	LARIMER	NA	50.0
8/19/1994 0:00:00	1700	CO	MESA	NA	5.0
8/19/1994 0:00:00	2030	CO	MESA	NA	50.0
8/12/1994 0:00:00	1630	CO	MORGAN	NA	2.0
8/4/1994 0:00:00	1615	CO	PARK	NA	4.0
8/1/1994 0:00:00	1700	CO	PHILLIPS	NA	50.0
8/13/1994 0:00:00	1415	CO	PUEBLO	NA	1.0
9/4/1994 0:00:00	0000	CO	PUEBLO	NA	25.0
5/2/1995 0:00:00	1510	CO	WELD	NA	1.0
4/4/1995 0:00:00	1322	CT	FAIRFIELD	NA	25.0
6/20/1995 0:00:00	1520	CT	HARTFORD	NA	20.0
6/20/1995 0:00:00	1535	CT	HARTFORD	NA	10.0
6/20/1995 0:00:00	1630	CT	MIDDLESEX	NA	0.7
4/4/1995 0:00:00	1320	CT	NEW HAVEN	NA	25.0
5/11/1995 0:00:00	1145	FL	COLUMBIA	NA	15.0
2/4/1995 0:00:00	0600	FL	HILLSBOROUGH	NA	10.0
2/13/1995 0:00:00	2048	FL	INDIAN RIVER	NA	50.0
2/3/1995 0:00:00	2210	FL	MARION	NA	25.0
2/13/1995 0:00:00	2115	FL	ST. LUCIE	NA	50.0

There are 209 cases in total, displayed are 30. The first row seems to be corrupted. A sample from the downloadable Data yields the following:

Location	County/Zone	St.	Date	Time	T.Z.	Type	Mag	Dth	Inj	PrD	CrD
Valle Vista	RIVERSIDE CO.	CA	08/12/1994	13:00	PST	Tornado	F0	0	0	0.00K	0.00K
Shelton	FAIRFIELD CO.	CT	04/04/1995	13:22	EST	Thunderstorm	Wind	0 kts.	0	0	0.25K

There is no clear-cut conclusion possible. The sample from Connecticut indicates a multiplier of 10, the sample from California a multiplier of 0. We tried to find several other rows from the dataset with no hits in the NOAA online dataset. We finally conclude, mostly for consistency reasons with the crops, that 0 values for PROPDMGEXP corresponds to a multiplier of 10.

We continue with cases larger than 0:

# PROPDMGEXP == 0..10
selector <- which(df$PROPDMGEXP %in% as.character(c(1:10)) & df$PROPDMG >0);
ddf <- df[selector,] %>%
  select(BGN_DATE,BGN_TIME,STATE,COUNTYNAME,CROPDMG,CROPDMGEXP,PROPDMG,PROPDMGEXP);
kable(ddf)

BGN_DATE	BGN_TIME	STATE	COUNTYNAME	CROPDMG	CROPDMGEXP	PROPDMG	PROPDMGEXP
2/8/1995 0:00:00	1800	CO	COZ003>005 - 009 - 010 - 012 - 013 - 015 - 017 - 018 - 023 - 033>036 - 038>040 - 060 - 061 - 065	0.0	NA	1.7	5
6/7/1995 0:00:00	1905	CO	WELD	25.0	M	30.0	5
4/4/1995 0:00:00	1400	CT	MIDDLESEX	0.0	NA	0.1	5
5/27/1995 0:00:00	1620	IL	CLINTON	0.0	NA	24.0	6
6/20/1995 0:00:00	1830	IL	CLINTON	0.0	NA	10.0	5
5/27/1995 0:00:00	1715	IL	GREENE	0.0	NA	14.0	5
6/20/1995 0:00:00	1900	IL	MACOUPIN	0.0	NA	12.0	5
6/8/1995 0:00:00	0612	IL	MADISON	0.0	NA	26.0	6
5/18/1995 0:00:00	1255	IL	MARION	0.0	NA	15.0	6
5/18/1995 0:00:00	1137	IL	MONROE	0.0	NA	88.0	5
5/18/1995 0:00:00	1140	IL	MONROE	0.0	NA	12.0	5
5/27/1995 0:00:00	1717	IL	MONTGOMERY	0.0	NA	10.0	4
9/7/1994 0:00:00	0115	KS	DECATUR	0.0	NA	3.0	5
7/28/1995 0:00:00	0145	MA	BRISTOL	0.0	NA	0.2	5
6/8/1995 0:00:00	0459	MO	CALLAWAY	0.0	NA	12.0	2
6/7/1995 0:00:00	1210	MO	FRANKLIN	0.0	NA	14.0	7
5/16/1995 0:00:00	1750	MO	SHELBY	2.5	K	20.0	3
5/16/1995 0:00:00	0510	MO	ST. LOUIS	0.0	NA	12.0	5
5/27/1995 0:00:00	1600	NE	NEZ006>010	0.0	NA	2.2	4
5/15/1995 0:00:00	1730	NC	NEW HANOVER	0.0	NA	1.0	5
7/4/1995 0:00:00	1930	NC	NORTHAMPTON	0.0	NA	68.0	7
6/21/1995 0:00:00	2030	PA	BEAVER	0.0	NA	1.0	5
8/18/1994 0:00:00	1500	PA	LYCOMING	0.0	NA	2.2	4
4/13/1995 0:00:00	0000	SD	SDZ004>008 - 010 - 011 - 016>018 - 034>037 - 051	0.0	NA	5.2	5
5/17/1995 0:00:00	0055	TX	PARMER	0.0	NA	0.2	5
3/11/1995 0:00:00	1118	UT	WASHINGTON	0.0	NA	0.7	5
3/11/1995 0:00:00	1625	UT	WASHINGTON	0.0	NA	0.1	4
7/27/1995 0:00:00	2020	VA	HENRICO	0.0	NA	13.0	5
6/22/1995 0:00:00	1950	VA	ROCKBRIDGE	430.0	K	6.4	5

There are 29 cases in total, all are displayed. Some rows, again, seem to be corrupted. A sample from the downloadable Data yields the following:

Location	County/Zone	St.	Date	Time	T.Z.	Type	Mag	PrD	CrD
Gilcrest	WELD CO.	CO	06/07/1995	19:05	MST	Hail	0.50 in.	0.30K	25.000M
St.Rose	CLINTON CO.	IL	05/27/1995	16:20	CST	Thunderstorm Wind	0 kts.	0.25K	0.00K
Alton	MADISON CO.	IL	06/08/1995	06:12	CST	Thunderstorm Wind	0 kts.	0.27K	0.00K
Shamrock	CALLAWAY CO.	MO	06/08/1995	04:59	CST	Thunderstorm Wind	0 kts.	0.12K	0.00K

There is no clear-cut conclusion possible. The multipliers can take on different values between 0 and 10.4166667. We tried to find several other rows from the dataset with no hits in the NOAA online dataset. We finally conclude, mostly for consistency reasons with the crops, that any numeric values for PROPDMGEXP corresponds to a multiplier of 10.

This therefore yields a mapping table as follows:

EXP <- c("K","M","B","NA","+","-","?","0","1","2","3","4","5","6","7","8","9");
Multiplier <- c(1000,1e6,1e9,0,0,0,0,10,10,10,10,10,10,10,10,10,10);
Type <- rep("PROP",17);
mappingTableProp <- data.frame(EXP=EXP, Multiplier=Multiplier, Type=Type);
mappingTable <- rbind(mappingTableCrop,mappingTableProp);
kable(mappingTable)

EXP	Multiplier	Type
K	1e+03	CROP
M	1e+06	CROP
B	1e+09	CROP
0	1e+01	CROP
?	0e+00	CROP
NA	0e+00	CROP
K	1e+03	PROP
M	1e+06	PROP
B	1e+09	PROP
NA	0e+00	PROP
+	0e+00	PROP
-	0e+00	PROP
?	0e+00	PROP
0	1e+01	PROP
1	1e+01	PROP
2	1e+01	PROP
3	1e+01	PROP
4	1e+01	PROP
5	1e+01	PROP
6	1e+01	PROP
7	1e+01	PROP
8	1e+01	PROP
9	1e+01	PROP

In the last step, we calculate the real economic costs using the values in the ...EXP field and the multipliers in the mappingTable.

To achieve this goal, we loop over the rows of mappingTable and calculate a total column for both crop and prop damage using a custom function calcTotalDamage:

    df <- df %>% 
        left_join(select(mappingTableCrop,-Type),by=c("CROPDMGEXP"="EXP")) %>% 
        mutate(CROP_Multiplier=Multiplier) %>%
        select(-Multiplier) %>%
        left_join(select(mappingTableProp,-Type),by=c("PROPDMGEXP"="EXP")) %>% 
        mutate(PROP_Multiplier=Multiplier) %>%
        select(-Multiplier) %>%
        mutate(CROPDMG_TOTAL=CROPDMG*CROP_Multiplier) %>%
        mutate(PROPDMG_TOTAL=PROPDMG*PROP_Multiplier) 
    
    # sometimes, the total values are NA because not all rows contain valid PROP and CROP damages
    # some contain only PROP and no CROP (-->NA) or only CROP and no PROP (-->NA) values
    # we set the NA values to 0 to allow for ease of use
    df$CROPDMG_TOTAL[is.na(df$CROPDMG_TOTAL)] <- 0;
    df$PROPDMG_TOTAL[is.na(df$PROPDMG_TOTAL)] <- 0;

Some more row pruning and summarising after this total damage calculation:

    df <- df %>% select(BGN_YEAR,BGN_MONTH,EVTYPE,
                         FATALITIES,INJURIES,
                         CROPDMG_TOTAL,PROPDMG_TOTAL)

    # now group by year, month, and EVTYPE
    df <- df %>% group_by (BGN_YEAR,
                           BGN_MONTH,
                           EVTYPE) %>%
        summarise_all(funs(sum)) %>%
        ungroup
    # finally, what is the size of the resulting dataset?
    dim(df)

## [1] 5870    7

So we now end up with 5870 rows and 7 cols. Finally, a small dataset.

OK, so what is the togal damage to population health and the economy?

    # let's summarize the total damage to population health and economy:
    summary_table <- df %>% summarize(FATALITIES_SUM=sum(FATALITIES),
                     INJURIES_SUM=sum(INJURIES),
                     CROPDMG_SUM=sum(CROPDMG_TOTAL),
                     PROPDMG_SUM=sum(PROPDMG_TOTAL))
    kable(summary_table)

FATALITIES_SUM	INJURIES_SUM	CROPDMG_SUM	PROPDMG_SUM
15081	139710	49028865010	425730811103

Apparently, we are talking billions here. So let’s reduce those numbers to something humanly readable

    df <- df %>% mutate(CROPDMG_TOTAL=CROPDMG_TOTAL/1000000000) %>%
    mutate(PROPDMG_TOTAL=PROPDMG_TOTAL/1000000000)

That should result into something more readable:

    # let's summarize the total damage to population health and economy:
    summary_table <- df %>% summarize(FATALITIES_SUM=sum(FATALITIES),
                     INJURIES_SUM=sum(INJURIES),
                     CROPDMG_SUM=sum(CROPDMG_TOTAL),
                     PROPDMG_SUM=sum(PROPDMG_TOTAL))
    kable(summary_table)

FATALITIES_SUM	INJURIES_SUM	CROPDMG_SUM	PROPDMG_SUM
15081	139710	49.02887	425.7308

The economic costs are given without inflation. At least, in the storm data pdf documentation, there is no mention of inflation. We are going to adjust for inflation by using the US Consumer Price Index and Inflation dataset.

    # inflation-adjustment
    download.file("https://datahub.io/core/cpi-us/r/cpiai.csv",destfile = "cpi.csv")
    cpi <- read_csv("cpi.csv")
    cpi <- cpi %>% mutate(year=year(ymd(Date)),month=month(ymd(Date)))
    df <- df %>% left_join(cpi,by=c("BGN_YEAR"="year","BGN_MONTH"="month"))
    df<- df %>% 
        mutate(PROPDMG_INFLADJUSTED_TOTAL=PROPDMG_TOTAL/Index) %>% 
        mutate(CROPDMG_INFLADJUSTED_TOTAL=CROPDMG_TOTAL/Index)
    # scale to dollars as of the end of the dataset
    df <- df %>% 
        mutate(PROPDMG_INFLADJUSTED_TOTAL = PROPDMG_INFLADJUSTED_TOTAL * df$Index[nrow(df)]) %>%
        mutate(CROPDMG_INFLADJUSTED_TOTAL = CROPDMG_INFLADJUSTED_TOTAL * df$Index[nrow(df)])
        # replace the TOTALs by the inflation adjusted values
    df <- df %>%
        mutate(PROPDMG_TOTAL=PROPDMG_INFLADJUSTED_TOTAL) %>%
        mutate(CROPDMG_TOTAL=CROPDMG_INFLADJUSTED_TOTAL) %>%
        select(-PROPDMG_INFLADJUSTED_TOTAL,-CROPDMG_INFLADJUSTED_TOTAL, -Index, -Inflation)
    
    
    summary_table <- df %>% summarize(FATALITIES_SUM=sum(FATALITIES),
                                      INJURIES_SUM=sum(INJURIES),
                                      CROPDMG_SUM=sum(CROPDMG_TOTAL),
                                      PROPDMG_SUM=sum(PROPDMG_TOTAL))
    
    kable(summary_table)

FATALITIES_SUM	INJURIES_SUM	CROPDMG_SUM	PROPDMG_SUM
15081	139710	64.85206	583.4175

Finally, we are going to drop the month information as we are not going to use it in the following:

    # now group by year and EVTYPE
    df <- df %>% group_by (BGN_YEAR,
                           EVTYPE) %>%
        summarise_all(funs(sum)) %>%
        ungroup %>%
        select(-BGN_MONTH)

We are finishing this section with a dataset of dimensions 1374x7 and 485 different event types.

Event Categorization

The National Wheather Service Documentation (section 2.1.1) lists 48 event types. We saved the list of these event types to the txt file Eventtypes.csv. The data in the main dataframe df, however, contains 485 different events. Our aim now is to map the event types existing in the dataset to those that are defined in the documentation. First, we are going to set all strings to uppercase and repair some obvious misspellings/abbreviations/semantic mappings.

    d <- df$EVTYPE;
    d <- toupper(d);
    
    d <- gsub("TSTM|T.?U.?DERSTORM.?|THUNDERESTORM.?|THUNDEERSTORM.?|THUNERSTORM.?","THUNDERSTORM",d,ignore.case=TRUE);
    d <- gsub("LIGHTING|LIGNTNING","LIGHTNING",d,ignore.case=T);
    d <- gsub("TORNDAO","TORNADO",d,ignore.case=TRUE);
    d <- gsub("WINDS|WINS","WIND",d,ignore.case=TRUE);
    d <- gsub("SHOWER","RAIN",d,ignore.case=TRUE);
    d <- gsub("WINTRY","WINTER",d,ignore.case=TRUE);
    d <- gsub("Low temperature","COLD",d,ignore.case=TRUE);
    d <- gsub("FLD|FLOODING|FLOODS","FLOOD",d,ignore.case=TRUE);
    d <- gsub("AVALANCE","AVALANCHE",d,ignore.case=TRUE);
    d <- gsub("  "," ",d,ignore.case=TRUE);
    
    df$EVTYPE <- d;

Good. But we still have 404 event types. One way to map these event types to the 48 given by the documentation is to select the closest possible event type from the documentation to any given event type of the data. We can do that using Rs adist function which calculates the Levenshtein Distance between strings. We then choose the the documentation event type with the lowest distance to the data event type. In case of ties, we simply take the first occurence. At the end, we create a mapping table between data event types and documentation eventtypes.

    strEVTYPES <- read_csv("Eventtypes.csv",col_names=FALSE)$X1
    
    d <- unique(d) %>% sort();
    a <- adist(d,strEVTYPES,ignore.case = T) %>%
        apply(1,which.min) %>%
        sapply(function(x){strEVTYPES[x]})
    
    c <- data.frame(EventTypesInData=d,EventTypesInDoc=a)
    kable(c  %>% arrange(EventTypesInData) %>% head(30))

EventTypesInData	EventTypesInDoc
?	Hail
AGRICULTURAL FREEZE	Frost/Freeze
APACHE COUNTY	Waterspout
ASTRONOMICAL HIGH TIDE	Astronomical Low Tide
ASTRONOMICAL LOW TIDE	Astronomical Low Tide
AVALANCHE	Avalanche
BEACH EROSION	Heavy Rain
BLACK ICE	Avalanche
BLIZZARD	Blizzard
BLIZZARD/WINTER STORM	Winter Storm
BLOWING DUST	Blizzard
BLOWING SNOW	Heavy Snow
BREAKUP FLOOD	Flash Flood
BRUSH FIRE	Wildfire
COASTAL EROSION	Coastal Flood
COASTAL FLOOD	Coastal Flood
COASTAL FLOOD/EROSION	Coastal Flood
COASTAL STORM	Coastal Flood
COASTAL SURGE	Coastal Flood
COASTALSTORM	Coastal Flood
COLD	Flood
COLD AIR TORNADO	Tornado
COLD AND SNOW	Heavy Snow
COLD AND WET CONDITIONS	Cold/Wind Chill
COLD TEMPERATURE	Cold/Wind Chill
COLD WAVE	Wildfire
COLD WEATHER	Winter Weather
COLD/WIND	High Wind
COLD/WIND CHILL	Cold/Wind Chill
COOL AND WET	Avalanche

We see that the result is kind of mixed. Some values seem to be matched kind of ok, some are not right at all. We will therefore resort to a manual mapping of event types.

    mappingTable <- data.frame(EventTypeData=d,
                               EventTypeNew=rep(NA,length(d)));
    selector <- grepl("TORNADO",d,ignore.case=TRUE) & is.na(mappingTable$EventTypeNew);
    mappingTable$EventTypeNew[selector] <- "TORNADO"
    selector <- grepl("HURRICANE|TYPHOON",d,ignore.case=TRUE)  & is.na(mappingTable$EventTypeNew);
    mappingTable$EventTypeNew[selector] <- "HURRICANE"
    selector <- grepl("THUNDERSTORM",d,ignore.case=TRUE)  & is.na(mappingTable$EventTypeNew);
    mappingTable$EventTypeNew[selector] <- "THUNDERSTORM"
    selector <- grepl("FLOOD|FLD|FLOODING|FLOODS",d,ignore.case=TRUE)  & is.na(mappingTable$EventTypeNew);
    mappingTable$EventTypeNew[selector] <- "FLOOD"
    selector <- grepl("DROUGHT",d,ignore.case=TRUE)  & is.na(mappingTable$EventTypeNew);
    mappingTable$EventTypeNew[selector] <- "DROUGHT"
    selector <- grepl("HEAT|.*hyperthermia.*|unseasonably warm.*|.*warm weather.*",d,ignore.case=TRUE)  & is.na(mappingTable$EventTypeNew);
    mappingTable$EventTypeNew[selector] <- "HEAT"
    selector <- grepl("COOL|COLD|.*hypothermia.*",d,ignore.case=TRUE)  & is.na(mappingTable$EventTypeNew);
    mappingTable$EventTypeNew[selector] <- "COLD"
    selector <- grepl("STORM",d,ignore.case=TRUE)  & is.na(mappingTable$EventTypeNew);
    mappingTable$EventTypeNew[selector] <- "STORM"
    selector <- grepl("WIND|NON.*THUNDERSTORM.*",d,ignore.case=TRUE)  & is.na(mappingTable$EventTypeNew);
    mappingTable$EventTypeNew[selector] <- "HIGH WIND"
    selector <- grepl("HAIL",d,ignore.case=TRUE)  & is.na(mappingTable$EventTypeNew);
    mappingTable$EventTypeNew[selector] <- "HAIL"
    selector <- grepl("RAIN|PRECIP.*",d,ignore.case=TRUE)  & is.na(mappingTable$EventTypeNew);
    mappingTable$EventTypeNew[selector] <- "RAIN"
    selector <- grepl("ICE|ICY",d,ignore.case=TRUE)  & is.na(mappingTable$EventTypeNew);
    mappingTable$EventTypeNew[selector] <- "SNOW/ICE"
    selector <- grepl("SNOW",d,ignore.case=TRUE)  & is.na(mappingTable$EventTypeNew);
    mappingTable$EventTypeNew[selector] <- "SNOW/ICE"
    selector <- grepl("FREEZ.*",d,ignore.case=TRUE)  & is.na(mappingTable$EventTypeNew);
    mappingTable$EventTypeNew[selector] <- "FROST/FREEZE"
    selector <- grepl("FROST.*",d,ignore.case=TRUE)  & is.na(mappingTable$EventTypeNew);
    mappingTable$EventTypeNew[selector] <- "FROST/FREEZE"
    selector <- grepl("TSUNAMI",d,ignore.case=TRUE)  & is.na(mappingTable$EventTypeNew);
    mappingTable$EventTypeNew[selector] <- "TSUNAMI"
    selector <- grepl("FIRE",d,ignore.case=TRUE)  & is.na(mappingTable$EventTypeNew);
    mappingTable$EventTypeNew[selector] <- "FIRE"
    selector <- grepl("WINTER",d,ignore.case=TRUE)  & is.na(mappingTable$EventTypeNew);
    mappingTable$EventTypeNew[selector] <- "WINTER"
    selector <- grepl("LIGHTNING",d,ignore.case=TRUE)  & is.na(mappingTable$EventTypeNew);
    mappingTable$EventTypeNew[selector] <- "LIGHTNING"
    selector <- grepl("FOG",d,ignore.case=TRUE)  & is.na(mappingTable$EventTypeNew);
    mappingTable$EventTypeNew[selector] <- "FOG"
    selector <- grepl("BLIZZARD",d,ignore.case=TRUE)  & is.na(mappingTable$EventTypeNew);
    mappingTable$EventTypeNew[selector] <- "BLIZZARD"
    selector <- grepl("AVALANCHE",d,ignore.case=TRUE)  & is.na(mappingTable$EventTypeNew);
    mappingTable$EventTypeNew[selector] <- "AVALANCHE"
    selector <- grepl("rip current.*",d,ignore.case=TRUE)  & is.na(mappingTable$EventTypeNew);
    mappingTable$EventTypeNew[selector] <- "RIP CURRENT"
    selector <- grepl("SURF",d,ignore.case=TRUE)  & is.na(mappingTable$EventTypeNew);
    mappingTable$EventTypeNew[selector] <- "HIGH SURF"
    selector <- grepl("LANDSLIDE",d,ignore.case=TRUE)  & is.na(mappingTable$EventTypeNew);
    mappingTable$EventTypeNew[selector] <- "LANDSLIDE"
    
    kable(mappingTable  %>% arrange(EventTypeData) %>% head(30))

EventTypeData	EventTypeNew
?	NA
AGRICULTURAL FREEZE	FROST/FREEZE
APACHE COUNTY	NA
ASTRONOMICAL HIGH TIDE	NA
ASTRONOMICAL LOW TIDE	NA
AVALANCHE	AVALANCHE
BEACH EROSION	NA
BLACK ICE	SNOW/ICE
BLIZZARD	BLIZZARD
BLIZZARD/WINTER STORM	STORM
BLOWING DUST	NA
BLOWING SNOW	SNOW/ICE
BREAKUP FLOOD	FLOOD
BRUSH FIRE	FIRE
COASTAL EROSION	NA
COASTAL FLOOD	FLOOD
COASTAL FLOOD/EROSION	FLOOD
COASTAL STORM	STORM
COASTAL SURGE	NA
COASTALSTORM	STORM
COLD	COLD
COLD AIR TORNADO	TORNADO
COLD AND SNOW	COLD
COLD AND WET CONDITIONS	COLD
COLD TEMPERATURE	COLD
COLD WAVE	COLD
COLD WEATHER	COLD
COLD/WIND	COLD
COLD/WIND CHILL	COLD
COOL AND WET	COLD

That seems better. There are still 13.12% data event types unmapped. Let’s have a look what percentage on population health and economic damage those represent.

    # apply the mappingTable
    df2 <- df %>% left_join(mappingTable,
                            by=c("EVTYPE"="EventTypeData"))
    ddf <- df2 %>% mutate(chooser=is.na(EventTypeNew)) %>% group_by(chooser) %>%
        summarize(sumFAT=sum(FATALITIES),
                  sumINJ=sum(INJURIES),
                  sumDMGCROP=sum(CROPDMG_TOTAL),
                  sumDMGPROP=sum(PROPDMG_TOTAL));
    kable(ddf)

chooser	sumFAT	sumINJ	sumDMGCROP	sumDMGPROP
FALSE	15028	139362	64.6399213	583.3668594
TRUE	53	348	0.2121368	0.0506244

Now, that seems reasonable. We have 24 event types, of which one contains all the unmatched data event types, but we succeeded to have more than 99% in each damage category within our mapped 23 damage categories. We think we can continue with the resulting dataset.

    df <- df2 %>% mutate(Event_Type=EventTypeNew) %>%
        select(-EVTYPE,EventTypeNew) %>%
        group_by (BGN_YEAR, Event_Type) %>%
        summarize(FATALITIES=sum(FATALITIES),
        INJURIES=sum(INJURIES),
        CROPDMG=sum(CROPDMG_TOTAL),
        PROPDMG=sum(PROPDMG_TOTAL)) %>%
    ungroup()

We have succeded to reduce the dataset to an 488x6 dataset still containing the essentials of the storm data. With this, we conclude the data preprocessing and will finally do some plots.

Results

Impact on population Health

Let’s first analyse which event type has the strongest impact on population health:

    # prepare the dataset for plotting
    df <- df %>% gather(Damage_Type, Damage_Value, FATALITIES:PROPDMG,factor_key = TRUE)
    df <- df %>%
        arrange(BGN_YEAR,Event_Type,Damage_Type,Damage_Value)
    df <- df %>% mutate(Damage_Category=ifelse(Damage_Type %in% c("FATALITIES","INJURIES"),
                                               "population Health",
                                               "Economic Damage"))
    df <- df %>% replace_na(list(Event_Type = "Other")) %>%
        mutate(Damage_Category=as.factor(Damage_Category)) %>%
        mutate(Event_Type=ifelse(is.na(Event_Type),"Other",Event_Type)) %>%
        mutate(Event_Type=as.factor(Event_Type)) 

    # the dataset for this specific plot    
    df_Plot <- df %>%
        subset(Damage_Category=="population Health") %>%
        group_by (Event_Type,Damage_Type) %>%
        summarise("N"=sum(Damage_Value)) %>%
        ungroup() %>%
        mutate (N2=ifelse(Damage_Type=="FATALITIES",-N,N/10)) %>% 
        left_join(df %>%
                      filter (Damage_Type=="FATALITIES") %>% 
                      select(BGN_YEAR,Event_Type,Damage_Value) %>% 
                      mutate(FAT=Damage_Value) %>% 
                      select (-Damage_Value) %>%
                      group_by(Event_Type) %>%
                      summarize(FAT=sum(FAT)) %>%
                      ungroup()) %>%
        arrange(desc(FAT)) %>%
        head(n=20)
    
    brks <- seq(-7500, 12500, 2500)
    lbls <- (as.character(c(seq(7500,0,-2500),seq(25000,125000,25000))))
    
    # the plot
    df_Plot %>%
        ggplot(aes(x=reorder(Event_Type,+FAT),
                   y=N2,
                   fill=Damage_Type)) + 
        geom_bar(stat="identity",width=0.6) + 
        geom_text(data=df_Plot %>%
                      filter(Damage_Type=="FATALITIES") %>%
                      arrange(desc(N)),
                  aes(label = N), hjust = 1., size=3,color = "#000000") +
        geom_text(data=df_Plot %>%
                      filter(Damage_Type=="INJURIES") %>%
                      arrange(desc(N)),
                  aes(label = N), hjust = 0., size=3,color = "#000000") +
        scale_y_continuous(breaks = brks, 
                           labels=lbls, 
                           limits=c(-7500,12500)) +
        scale_fill_manual(values = c("red", "blue")) +
        labs(x="Event Type",
             y="Number of affected people",
             title="Top 10 population health impacting event types",
             subtitle="Ordered by decreasing number of fatalities") +
        guides(fill=guide_legend(title="New Legend Title")) + 
        theme(legend.title=element_blank()) +
        theme(plot.title = element_text(hjust = 0.5),
              plot.subtitle = element_text(hjust=0.5)) +
        coord_flip()

We can see that Tornados, Heat and Floods have the strongest impact on population Health, Both for Fatalities and Injuries. The most devastating effects on population health are concentrated on just a few event types, the top three cover already more than 2/3 of the total fatalities for all event types. A question that arises is: what is the percentage of total Casualties covered by these Eventtypes? What is the percentage of the total damage in other damage categories (CROP, PROP, ECONOMY=CROP+PROP), INJURIES) covered by these event types?. Let’s quantify that:

# total damage
df_total_damage <- df %>% group_by(Damage_Type) %>% summarise(Total=sum(Damage_Value))

myfunc_createOrderedDFOfEventTypes <- function(DT,df){
    if (DT == "ECONOMY"){
        DT2 <- c("CROPDMG","PROPDMG")
    }
    else{
        DT2 <- c(DT)
    }
    
    df %>% 
        filter(Damage_Type %in% DT2) %>% 
        group_by (Event_Type) %>% 
        summarize(Damage_Value=sum(Damage_Value)) %>% 
        arrange(desc(Damage_Value)) %>% 
        ungroup() 
}


# Event Types ordered wrt all Damage Types, all years
DTs <- toupper(c("Fatalities","Injuries","PropDmg","PropDmg","Economy"))
l_OrderedEventTypes <- lapply(DTs,function(DT){myfunc_createOrderedDFOfEventTypes(DT,df)})
names(l_OrderedEventTypes) <- DTs

myfunc_createSummaryDF <- function(df_total_damage,df,l_OrderedEventTypes,n,DT){
    df_total_damage %>% left_join(
        df %>% filter (Event_Type %in% (
            l_OrderedEventTypes[[DT]] %>% head(n))$Event_Type) %>% 
            group_by (Damage_Type) %>% 
            summarize(Damage_Value=sum(Damage_Value)) 
    ) %>% 
        mutate(!!sprintf("Percentage damage",n,"CROP"):=Damage_Value/Total*100)
}

# impact of top n eventtypes
n <- 3
DT <- "FATALITIES"
ddf <- myfunc_createSummaryDF(df_total_damage,df,l_OrderedEventTypes,n,DT)
ddf$Total <- round(ddf$Total)
ddf$Damage_Value <- round(ddf$Damage_Value)
kable(ddf, "html",
      caption=sprintf("Impact of the top %d event types wrt %s on total values",n,DT)) %>%
    kable_styling(full_width = F)

Impact of the top 3 event types wrt FATALITIES on total values
Damage_Type	Total	Damage_Value	Percentage damage
FATALITIES	15081	10325	68.46363
INJURIES	139710	108512	77.66946
CROPDMG	65	18	28.30610
PROPDMG	583	340	58.29518

OK, so the top three event types for Fatalities cover more than 2/3 of all Fatalities and all Injuries, more than half of the Property damage, and still more than 25% of the Crop Damage. We can confidently say that Tornados, Heat and Flood have the most devastating effects on population Health while they have a very strong impact on the economy.

Impact on the economy

Now, let’s do the same analysis for economic consequences. In this case, we will consider the strongest impacts on the economy, which we consider to be PROPDMG+CROPDMG. Let’s do the plot:

    df$Damage_Type <- factor(df$Damage_Type, levels(df$Damage_Type)[c(1,2,4,3)])
    

    # the dataset for this specific plot    
    df_Plot <- df %>%
        subset(Damage_Category=="Economic Damage") %>%
        group_by (Event_Type,Damage_Type) %>%
        summarise("N"=sum(Damage_Value)) %>%
        ungroup() %>%
        mutate (N2=ifelse(Damage_Type=="PROPDMG",-N,N*10)) %>% 
        left_join(df %>%
                      filter (Damage_Type %in% c("PROPDMG","CROPDMG")) %>% 
                      select(BGN_YEAR,Event_Type,Damage_Value) %>% 
                      mutate(ECONOMICDMG=Damage_Value) %>% 
                      select (-Damage_Value) %>%
                      group_by(Event_Type) %>%
                      summarize(ECONOMICDMG=sum(ECONOMICDMG)) %>%
                      ungroup()) %>%
        arrange(desc(ECONOMICDMG)) %>%
        head(n=20)
    
    brks <- seq(-250, 200, 50)
    lbls <- (as.character(c(seq(250,0,-50),seq(50,200,50))))
    
    # the plot
    df_Plot %>%
        ggplot(aes(x=reorder(Event_Type,+N),
                   y=N2,
                   fill=Damage_Type)) + 
        geom_bar(stat="identity",width=0.6) + 
        geom_text(data=df_Plot %>%
                      filter(Damage_Type=="PROPDMG") %>%
                      arrange(desc(N)),
                  aes(label = round(N,1)), hjust = 1., size=3,color = "#000000") +
        geom_text(data=df_Plot %>%
                      filter(Damage_Type=="CROPDMG") %>%
                      arrange(desc(N)),
                  aes(label = round(N,1)), hjust = 0., size=3,color = "#000000") +
        scale_y_continuous(breaks = brks, 
                           labels=lbls, 
                           limits=c(-250,200)) +
        scale_fill_manual(values = c("red", "blue"),
                          labels = c("Property Damage","Crop Damage")) +
        labs(x="Event Type",
             y="Damage in Billion Dollars, inflation adjusted",
             title="Top 10 event types impacting the economy",
             subtitle="Ordered by decreasing value of economic consequences") +
        guides(fill=guide_legend(title="New Legend Title")) + 
        theme(legend.title=element_blank()) +
        theme(plot.title = element_text(hjust = 0.5),
              plot.subtitle = element_text(hjust=0.5)) +
        coord_flip()

It appears that the most devastating event types wrt economic loss are Floods, Tornados, and Hurricanes. The first two (Tornados and Floods) are also in the top 3 of the most devasttating event types for population health. The third one, Hurricanes, has a strong impact on the economy, but a low impact on population health (outside the top 10 event types wrt population health). Droughts are a special case as they have the strongest impact of all event types on Crop damage, but a much lower impact on Property and population health (outside the top 10).

Let’s again have a look on the impact of the top 3 event types for economic consequences wrt the other damage categories:

n <- 3
DT <- "ECONOMY"
ddf <- myfunc_createSummaryDF(df_total_damage,df,l_OrderedEventTypes,n,DT)
ddf$Total <- round(ddf$Total)
ddf$Damage_Value <- round(ddf$Damage_Value)
kable(ddf, "html",
      caption=sprintf("Impact of the top %d event types wrt %s on total values",n,DT)) %>%
    kable_styling(full_width = F)

Impact of the top 3 event types wrt ECONOMY on total values
Damage_Type	Total	Damage_Value	Percentage damage
FATALITIES	15081	7289	48.33234
INJURIES	139710	100617	72.01847
CROPDMG	65	24	37.22013
PROPDMG	583	442	75.76431

It is apparent that in this case, the top 3 event types do not explain as much of the total damage as they did in the case of the consequences on population health. The economic consequences are more spread out among the event types when compared to the consequences on population health. Let’s consider the top 10 event types instead of the top 3:

n <- 10
DT <- "ECONOMY"
ddf <- myfunc_createSummaryDF(df_total_damage,df,l_OrderedEventTypes,n,DT)
ddf$Total <- round(ddf$Total)
ddf$Damage_Value <- round(ddf$Damage_Value)
kable(ddf, "html",
      caption=sprintf("Impact of the top %d event types wrt %s on total values",n,DT)) %>%
    kable_styling(full_width = F)

Impact of the top 10 event types wrt ECONOMY on total values
Damage_Type	Total	Damage_Value	Percentage damage
FATALITIES	15081	9137	60.58617
INJURIES	139710	119637	85.63238
CROPDMG	65	59	90.32327
PROPDMG	583	579	99.19829

So the top 10 event types cover far more of the total value wrt damage categories not including Fatalities (more than 85%), but wrt the impact on Fatalities they still only explain 60%.

So we can conclude the following:

Two event types, Floods and Tornados, are within the top three event types wrt both economic and population health consequences
Hurricanes have a strong impact on the economy (rank 3), but a rather low impact on pubic health (not inside the top 10 events)
Heat has a rather high impact on population health (rank 3), but a rather low impact on the economy (outside the top 10 events)
Droughts are a special case inasmuch as they have a strong impact on Crop damage, but low impact on both population health as well as property