Synopsis

Extreme weather events like storms and hurricanes can have severe consequences on both population health and the economy. One key concern of the administration is therefore to prevent negative consequences of such events. The U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database collects the impact of such weather events on U.S. soil. Analysis of this database allows the administration to decide how and where to best spend available resources.

This report explores this storm database and answers the following questions:

This report shows that tornados are most harmful for population health, while floods result in the largest economic loss.

Data Processing

First, load some libraries we are going to use in the rest of the analysis

require(readr)
require(dplyr)
require(tidyr)
require(knitr)
require(lubridate)
require(ggplot2)
require(kableExtra)

Download Storm Data File

We are now downloading the NOAA storm database and read it into an R dataframe:

URL <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2";
FN <- "StormData.csv.bz2";
download.file(URL,destfile = FN);
df <- read_csv(FN);
df_orig <- df;

Data Preprocessing

Pruning data

Many of the rows are not interesting for us as they have neither human nor economic consequences. We prune the corresponding rows.

  df <- df %>% 
    filter(FATALITIES > 0 | INJURIES > 0 | CROPDMG > 0 | PROPDMG > 0 )

Before this pruning, there were 902297 rows. After the pruning, there are 254633, a reduction by 71.78%. Some of the rows appear to be duplicates. There are many rows with identical content except for the REFNUM column. We will eliminate these duplicates.

  dupesWhenIgnoringRefnum <- df %>% 
    select(-REFNUM) %>% 
    duplicated();
  
  df <- df %>% 
    mutate(dupesWhenIgnoringRefnum=dupesWhenIgnoringRefnum);
  
  df_WithDuplicateInformation <- df;

  # after this operation, df does not contain duplicates due to REFNUM
  df <- df_WithDuplicateInformation %>%
    filter(dupesWhenIgnoringRefnum==FALSE) %>%
    select(-dupesWhenIgnoringRefnum)

We still have 253894 rows, but we achieved a further reduction by 0.29%. There are probably far more duplicates. For example, some rows are identical except for some field being NA in one row, while the duplicate candidate has some value in the same field. There are also some duplicate candidates where the only difference is some different spelling in the REMARKS field. However, we can not generalize this finding - sometimes, there is more than just a spelling difference, sometimes, there is a different BGN_LOCATI and so on. As we do not know exactly the origin of the duplicates, we will not continue to further prune the dataset due to duplicates.

Some more pruning, this time for the columns:

    # select only columns that we need
    df <- df %>% select(BGN_DATE,BGN_TIME,STATE,COUNTYNAME,
                        EVTYPE,
                        FATALITIES,INJURIES,
                        CROPDMG,CROPDMGEXP,
                        PROPDMG,PROPDMGEXP,
                        REFNUM,REMARKS)

    # We will only work with year and month of the event
    df <- df %>% mutate(BGN_YEAR=year(mdy_hms(BGN_DATE)),
                        BGN_MONTH=month(mdy_hms(BGN_DATE)))

Preparation for the Calculation of Economic Costs

The data contains fields describing economic (CROPDMG,PROPDMG) and human (FATALITIES,INJURIES) consequences. The values for CROPDMG and PROPDMG are associated with a multiplier (CROPDMGEXP, PROPDMGEXP). The associated information (section 2.7) only describes K (thousands) M (Millions) and B (Billions) as possible multipliers. However, insspecting the dataset reveals that far more mulitpliers have been used.

strHere <- "CROPDMGEXP";   
paste(c(strHere,": ",unique(df[[strHere]])), collapse=" ")
## [1] "CROPDMGEXP :  NA M K m B ? 0 k"
strHere <- "PROPDMGEXP";   
paste(c(strHere,": ",unique(df[[strHere]])), collapse=" ")
## [1] "PROPDMGEXP :  K M NA B m + 0 5 6 4 h 2 7 3 H -"

In the following, we will assume that lowercase letters should be uppercase:

strHere <- "CROPDMGEXP";   
paste(c(strHere,": ",unique(df[[strHere]])), collapse=" ")
## [1] "CROPDMGEXP :  NA M K m B ? 0 k"
strHere <- "PROPDMGEXP";
paste(c(strHere,": ",unique(df[[strHere]])), collapse=" ")
## [1] "PROPDMGEXP :  K M NA B m + 0 5 6 4 h 2 7 3 H -"
# cast lowercase letters to uppercase letters
df$CROPDMGEXP <- toupper(df$CROPDMGEXP);
df$PROPDMGEXP <- toupper(df$PROPDMGEXP);

strHere <- "CROPDMGEXP";
paste(c(strHere,": ",unique(df[[strHere]])), collapse=" ")
## [1] "CROPDMGEXP :  NA M K B ? 0"
strHere <- "PROPDMGEXP";
paste(c(strHere,": ",unique(df[[strHere]])), collapse=" ")
## [1] "PROPDMGEXP :  K M NA B + 0 5 6 4 H 2 7 3 -"

There are still a number of multipliers that we do not understand. One way to fix this problem is to go to the NOAA website and download interactively pieces of data that correspond to our missing multipliers.

require(dplyr)
require(knitr)

# CROPDMGEXP is NA
ddf <- df[is.na(df$CROPDMGEXP) & df$CROPDMG > 0,] %>%
  select(BGN_DATE,BGN_TIME,STATE,COUNTYNAME,CROPDMG,CROPDMGEXP,PROPDMG,PROPDMGEXP);
kable(ddf)
BGN_DATE BGN_TIME STATE COUNTYNAME CROPDMG CROPDMGEXP PROPDMG PROPDMGEXP
7/4/1994 0:00:00 0400 ND STUTSMAN 3 NA 5 K
4/5/1994 0:00:00 1700 TX HAYS 4 NA 5 M
4/15/1994 0:00:00 1630 TX MEDINA 4 NA 500 K

We download the corresponding data and find:

Location County/Zone St. Date Time T.Z. Type Mag Dth Inj PrD CrD
Jamestown STUTSMAN CO. ND 07/04/1994 04:00 CST Hail 1.75 in. 0 0 5.00K 0.00K
Jamestown STUTSMAN CO. ND 07/04/1994 04:55 CST Hail 1.25 in. 0 0 5.00K 5.00K

We conclude that there was no crop damage for this event. Possibly, all NA values indicate that the corresponding Crop Damage should be zero? Downloading the data for the Texas case confirms that:

Location County/Zone St. Date Time T.Z. Type Mag Dth Inj PrD CrD
San Marcos HAYS CO. TX 04/05/1994 17:00 CST Thunderstorm Wind 52 kts. 0 0 5.000M 0.00K

We therefore conclude that cases where is.na(CROPDMGEXP) & CROPDMG>0 is TRUE is in fact a multiplier of 0.

Now we are going to look at the ? cases:

# CROPDMGEXP is ?
selector <- which(df$CROPDMGEXP == "?" & df$CROPDMG >0);
length(selector);
## [1] 0

We see that there are no cases with CROPDMGEXP=="?" & CROPDMG >0 as the length of this selector is 0.

The last case for crops is the 0. Let’s have a look at that:

# CROPDMGEXP=="0"
selector <- which(df$CROPDMGEXP == "0" & df$CROPDMG >0);
ddf <- df[selector,] %>%
  select(BGN_DATE,BGN_TIME,STATE,COUNTYNAME,CROPDMG,CROPDMGEXP,PROPDMG,PROPDMGEXP);
kable(ddf)
BGN_DATE BGN_TIME STATE COUNTYNAME CROPDMG CROPDMGEXP PROPDMG PROPDMGEXP
7/17/1995 0:00:00 0258 GA CAMDEN 20 0 0 NA
10/6/1994 0:00:00 2105 IA BOONE 5 0 5 K
5/27/1995 0:00:00 1901 IA CRAWFORD 50 0 100 K
10/6/1994 0:00:00 1630 IA HANCOCK 25 0 5 K
10/6/1994 0:00:00 2225 IA MITCHELL 25 0 25 K
5/27/1995 0:00:00 2119 IA RINGGOLD 60 0 30 K
10/6/1994 0:00:00 2038 IA WEBSTER 5 0 2 K
10/6/1994 0:00:00 2039 IA WEBSTER 5 0 5 K
5/27/1995 0:00:00 2057 IA WEBSTER 50 0 40 K
10/6/1994 0:00:00 2123 IA WINNEBAGO 5 0 5 K
10/6/1994 0:00:00 2135 IA WINNEBAGO 5 0 2 K
10/6/1994 0:00:00 2101 IA WRIGHT 5 0 2 K

So there are 12 cases for this case. Let’s have a look at two cases:

Location County/Zone St. Date Time T.Z. Type Mag Dth Inj PrD CrD
Spring Bluff CAMDEN CO. GA 07/17/1995 02:58 EST Hail 0.75 in. 0 0 0.00K 0.20K
Mt Ayr to RINGGOLD CO. IA 05/27/1995 21:19 CST Tornado F1 0 0 30.00K 0.60K

These tow cases suggest that CROPDMGEXP=="0" corresponds to a multiplier of 10. Due to temporal constraints we are not investigating any further and get the following table for a mapping between CROPDMGEXP and multipliers:

EXP <- c("K","M","B","0","?","NA");
Multiplier <- c(1000,1e6,1e9,10,0,0);
Type <- rep("CROP",6);
mappingTableCrop <- data.frame(EXP=EXP, Multiplier=Multiplier, Type=Type);
kable(mappingTableCrop)
EXP Multiplier Type
K 1e+03 CROP
M 1e+06 CROP
B 1e+09 CROP
0 1e+01 CROP
? 0e+00 CROP
NA 0e+00 CROP

Let’s do the same thing for the Property damages. There are quite a bit more. We start with NA.

# PROPDMGEXP is "NA"
selector <- which(is.na(df$PROPDMGEXP) & df$PROPDMG >0);
ddf <- df[selector,] %>%
  select(BGN_DATE,BGN_TIME,STATE,COUNTYNAME,CROPDMG,CROPDMGEXP,PROPDMG,PROPDMGEXP);
kable(head(ddf))
BGN_DATE BGN_TIME STATE COUNTYNAME CROPDMG CROPDMGEXP PROPDMG PROPDMGEXP
3/9/1995 0:00:00 0301 CA MARIN 0 ? 0.41 NA
1/8/1993 0:00:00 1130 FL UNION 0 NA 3.00 NA
8/19/1995 0:00:00 1700 GA DECATUR 0 NA 2.00 NA
3/31/1993 0:00:00 2015 GA HART 0 NA 4.00 NA
5/12/1993 0:00:00 1630 IN CASS 0 NA 4.00 NA
5/14/1995 0:00:00 0300 IN VANDERBURGH 0 NA 10.00 NA

There are 76 cases, but we only show the first 6 ones. A sample from the downloadable Data yields the following:

Location County/Zone St. Date Time T.Z. Type Mag Dth Inj PrD CrD
Gainesville UNION CO. FL 01/08/1993 11:30 EST Tornado F0 0 0 0.00K 0.00K
CASS CO. CASS CO. IN 05/12/1993 16:30 EST Thunderstorm Wind 0 kts. 0 0 0.00K 0.00K

We therefore conclude that NA values for PROPDMGEXP corresponds to a multiplier of 0.

We continue with H (we suspect this corresponds to 100):

# PROPDMGEXP == "H"
selector <- which(df$PROPDMGEXP=="H" & df$PROPDMG >0);
ddf <- df[selector,] %>%
  select(BGN_DATE,BGN_TIME,STATE,COUNTYNAME,CROPDMG,CROPDMGEXP,PROPDMG,PROPDMGEXP);
kable(ddf)
BGN_DATE BGN_TIME STATE COUNTYNAME CROPDMG CROPDMGEXP PROPDMG PROPDMGEXP
9/16/1994 0:00:00 1630 MI CLINTON 0 NA 2 H
7/14/1995 0:00:00 1923 NE SHERMAN 0 NA 5 H
1/14/1995 0:00:00 1050 SC LAURENS 0 NA 5 H
3/8/1995 0:00:00 1305 SC LAURENS 0 NA 2 H
1/19/1995 0:00:00 1815 SC PICKENS 0 NA 3 H
7/12/1995 0:00:00 1300 WY BIG HORN 0 NA 5 H
7/15/1995 0:00:00 1818 WY CAMPBELL 0 NA 5 H

There are 7 cases in total. A sample from the downloadable Data yields the following:

Location County/Zone St. Date Time T.Z. Type Mag Dth Inj PrD CrD
St. Johns CLINTON CO. MI 09/16/1994 16:30 EST Thunderstorm Wind 50 kts. 0 0 0.20K 0.00K
BIG HORN CO. BIG HORN CO. WY 07/12/1995 13:00 MST Thunderstorm Wind 54 kts. 0 0 0.50K 0.00K

The data confirm our suspicion for the two samples. We therefore conclude that H values for PROPDMGEXP corresponds to a multiplier of 100.

We continue with -:

# PROPDMGEXP == "-"
selector <- which(df$PROPDMGEXP=="-" & df$PROPDMG >0);
ddf <- df[selector,] %>%
  select(BGN_DATE,BGN_TIME,STATE,COUNTYNAME,CROPDMG,CROPDMGEXP,PROPDMG,PROPDMGEXP);
kable(ddf)
BGN_DATE BGN_TIME STATE COUNTYNAME CROPDMG CROPDMGEXP PROPDMG PROPDMGEXP
12/12/1995 0:00:00 1000 OR ORZ004 - 05 - 06 - 08 - 09 0 NA 15 -

There is only 1 case in total. The row seems to be corrupted as there is no proper county name. A search in the downloadable Data yields no hits for this date and Oregon. We therefore conclude that - corresponds to a multiplier of 0.

We continue with +:

# PROPDMGEXP == "+"
selector <- which(df$PROPDMGEXP=="+" & df$PROPDMG >0);
ddf <- df[selector,] %>%
  select(BGN_DATE,BGN_TIME,STATE,COUNTYNAME,CROPDMG,CROPDMGEXP,PROPDMG,PROPDMGEXP);
kable(ddf)
BGN_DATE BGN_TIME STATE COUNTYNAME CROPDMG CROPDMGEXP PROPDMG PROPDMGEXP
5/1/1995 0:00:00 0000 AK AKZ001 0 NA 20 +
12/1/1994 0:00:00 0000 AK AKZ024 - 006 - 005 0 NA 20 +
3/9/1995 0:00:00 1000 CA CAZ001 0 NA 2 +
6/16/1995 0:00:00 0330 NV NVZ001 - 002 - 003 - 004 - 005 0 NA 15 +
6/5/1995 0:00:00 1304 NV NVZ003 - 004 - 007 0 NA 60 +

There are 5 cases in total. Again, the rows seem corrupted due to strange county names. We checked row 1 and 3 on the online NOAA database and did not find any hits corresponding to the entries in the table above. We therefore conclude that - corresponds to a multiplier of 0.

We continue with 0 (we suspect this corresponds to 10 as in the case of crops):

# PROPDMGEXP == "0"
selector <- which(df$PROPDMGEXP=="0" & df$PROPDMG >0);
ddf <- df[selector,] %>%
  select(BGN_DATE,BGN_TIME,STATE,COUNTYNAME,CROPDMG,CROPDMGEXP,PROPDMG,PROPDMGEXP);
kable(head(ddf,n=30))
BGN_DATE BGN_TIME STATE COUNTYNAME CROPDMG CROPDMGEXP PROPDMG PROPDMGEXP
12/3/1994 0:00:00 0000 AK AKZ024 - 006 - 005 0 NA 150.0 0
8/10/1994 0:00:00 2100 CO JEFFERSON 0 NA 50.0 0
8/10/1994 0:00:00 2130 CO JEFFERSON 0 NA 5.0 0
8/11/1994 0:00:00 0030 CO JEFFERSON 0 NA 37.0 0
8/12/1994 0:00:00 1300 CA RIVERSIDE 0 NA 0.5 0
8/13/1994 0:00:00 1300 CA RIVERSIDE 0 NA 0.5 0
8/10/1994 0:00:00 2100 CO BOULDER 0 NA 1.0 0
8/10/1994 0:00:00 2120 CO BOULDER 0 NA 1.0 0
8/1/1994 0:00:00 1415 CO CLEAR CREEK 0 NA 50.0 0
8/13/1994 0:00:00 1345 CO GILPIN 0 NA 1.5 0
5/2/1995 0:00:00 1820 CO JEFFERSON 0 NA 20.0 0
8/10/1994 0:00:00 2100 CO LARIMER 0 NA 50.0 0
8/19/1994 0:00:00 1700 CO MESA 0 NA 5.0 0
8/19/1994 0:00:00 2030 CO MESA 0 NA 50.0 0
8/12/1994 0:00:00 1630 CO MORGAN 0 NA 2.0 0
8/4/1994 0:00:00 1615 CO PARK 0 NA 4.0 0
8/1/1994 0:00:00 1700 CO PHILLIPS 0 NA 50.0 0
8/13/1994 0:00:00 1415 CO PUEBLO 0 NA 1.0 0
9/4/1994 0:00:00 0000 CO PUEBLO 0 NA 25.0 0
5/2/1995 0:00:00 1510 CO WELD 0 NA 1.0 0
4/4/1995 0:00:00 1322 CT FAIRFIELD 0 NA 25.0 0
6/20/1995 0:00:00 1520 CT HARTFORD 0 NA 20.0 0
6/20/1995 0:00:00 1535 CT HARTFORD 0 NA 10.0 0
6/20/1995 0:00:00 1630 CT MIDDLESEX 0 NA 0.7 0
4/4/1995 0:00:00 1320 CT NEW HAVEN 0 NA 25.0 0
5/11/1995 0:00:00 1145 FL COLUMBIA 0 NA 15.0 0
2/4/1995 0:00:00 0600 FL HILLSBOROUGH 0 NA 10.0 0
2/13/1995 0:00:00 2048 FL INDIAN RIVER 0 NA 50.0 0
2/3/1995 0:00:00 2210 FL MARION 0 NA 25.0 0
2/13/1995 0:00:00 2115 FL ST. LUCIE 0 NA 50.0 0

There are 209 cases in total, displayed are 30. The first row seems to be corrupted. A sample from the downloadable Data yields the following:

Location County/Zone St. Date Time T.Z. Type Mag Dth Inj PrD CrD
Valle Vista RIVERSIDE CO. CA 08/12/1994 13:00 PST Tornado F0 0 0 0.00K 0.00K
Shelton FAIRFIELD CO. CT 04/04/1995 13:22 EST Thunderstorm Wind 0 kts. 0 0 0.25K

There is no clear-cut conclusion possible. The sample from Connecticut indicates a multiplier of 10, the sample from California a multiplier of 0. We tried to find several other rows from the dataset with no hits in the NOAA online dataset. We finally conclude, mostly for consistency reasons with the crops, that 0 values for PROPDMGEXP corresponds to a multiplier of 10.

We continue with cases larger than 0:

# PROPDMGEXP == 0..10
selector <- which(df$PROPDMGEXP %in% as.character(c(1:10)) & df$PROPDMG >0);
ddf <- df[selector,] %>%
  select(BGN_DATE,BGN_TIME,STATE,COUNTYNAME,CROPDMG,CROPDMGEXP,PROPDMG,PROPDMGEXP);
kable(ddf)
BGN_DATE BGN_TIME STATE COUNTYNAME CROPDMG CROPDMGEXP PROPDMG PROPDMGEXP
2/8/1995 0:00:00 1800 CO COZ003>005 - 009 - 010 - 012 - 013 - 015 - 017 - 018 - 023 - 033>036 - 038>040 - 060 - 061 - 065 0.0 NA 1.7 5
6/7/1995 0:00:00 1905 CO WELD 25.0 M 30.0 5
4/4/1995 0:00:00 1400 CT MIDDLESEX 0.0 NA 0.1 5
5/27/1995 0:00:00 1620 IL CLINTON 0.0 NA 24.0 6
6/20/1995 0:00:00 1830 IL CLINTON 0.0 NA 10.0 5
5/27/1995 0:00:00 1715 IL GREENE 0.0 NA 14.0 5
6/20/1995 0:00:00 1900 IL MACOUPIN 0.0 NA 12.0 5
6/8/1995 0:00:00 0612 IL MADISON 0.0 NA 26.0 6
5/18/1995 0:00:00 1255 IL MARION 0.0 NA 15.0 6
5/18/1995 0:00:00 1137 IL MONROE 0.0 NA 88.0 5
5/18/1995 0:00:00 1140 IL MONROE 0.0 NA 12.0 5
5/27/1995 0:00:00 1717 IL MONTGOMERY 0.0 NA 10.0 4
9/7/1994 0:00:00 0115 KS DECATUR 0.0 NA 3.0 5
7/28/1995 0:00:00 0145 MA BRISTOL 0.0 NA 0.2 5
6/8/1995 0:00:00 0459 MO CALLAWAY 0.0 NA 12.0 2
6/7/1995 0:00:00 1210 MO FRANKLIN 0.0 NA 14.0 7
5/16/1995 0:00:00 1750 MO SHELBY 2.5 K 20.0 3
5/16/1995 0:00:00 0510 MO ST. LOUIS 0.0 NA 12.0 5
5/27/1995 0:00:00 1600 NE NEZ006>010 0.0 NA 2.2 4
5/15/1995 0:00:00 1730 NC NEW HANOVER 0.0 NA 1.0 5
7/4/1995 0:00:00 1930 NC NORTHAMPTON 0.0 NA 68.0 7
6/21/1995 0:00:00 2030 PA BEAVER 0.0 NA 1.0 5
8/18/1994 0:00:00 1500 PA LYCOMING 0.0 NA 2.2 4
4/13/1995 0:00:00 0000 SD SDZ004>008 - 010 - 011 - 016>018 - 034>037 - 051 0.0 NA 5.2 5
5/17/1995 0:00:00 0055 TX PARMER 0.0 NA 0.2 5
3/11/1995 0:00:00 1118 UT WASHINGTON 0.0 NA 0.7 5
3/11/1995 0:00:00 1625 UT WASHINGTON 0.0 NA 0.1 4
7/27/1995 0:00:00 2020 VA HENRICO 0.0 NA 13.0 5
6/22/1995 0:00:00 1950 VA ROCKBRIDGE 430.0 K 6.4 5

There are 29 cases in total, all are displayed. Some rows, again, seem to be corrupted. A sample from the downloadable Data yields the following:

Location County/Zone St. Date Time T.Z. Type Mag Dth Inj PrD CrD
Gilcrest WELD CO. CO 06/07/1995 19:05 MST Hail 0.50 in. 0 0 0.30K 25.000M
St.Rose CLINTON CO. IL 05/27/1995 16:20 CST Thunderstorm Wind 0 kts. 0 0 0.25K 0.00K
Alton MADISON CO. IL 06/08/1995 06:12 CST Thunderstorm Wind 0 kts. 0 0 0.27K 0.00K
Shamrock CALLAWAY CO. MO 06/08/1995 04:59 CST Thunderstorm Wind 0 kts. 0 0 0.12K 0.00K

There is no clear-cut conclusion possible. The multipliers can take on different values between 0 and 10.4166667. We tried to find several other rows from the dataset with no hits in the NOAA online dataset. We finally conclude, mostly for consistency reasons with the crops, that any numeric values for PROPDMGEXP corresponds to a multiplier of 10.

This therefore yields a mapping table as follows:

EXP <- c("K","M","B","NA","+","-","?","0","1","2","3","4","5","6","7","8","9");
Multiplier <- c(1000,1e6,1e9,0,0,0,0,10,10,10,10,10,10,10,10,10,10);
Type <- rep("PROP",17);
mappingTableProp <- data.frame(EXP=EXP, Multiplier=Multiplier, Type=Type);
mappingTable <- rbind(mappingTableCrop,mappingTableProp);
kable(mappingTable)
EXP Multiplier Type
K 1e+03 CROP
M 1e+06 CROP
B 1e+09 CROP
0 1e+01 CROP
? 0e+00 CROP
NA 0e+00 CROP
K 1e+03 PROP
M 1e+06 PROP
B 1e+09 PROP
NA 0e+00 PROP
+ 0e+00 PROP
- 0e+00 PROP
? 0e+00 PROP
0 1e+01 PROP
1 1e+01 PROP
2 1e+01 PROP
3 1e+01 PROP
4 1e+01 PROP
5 1e+01 PROP
6 1e+01 PROP
7 1e+01 PROP
8 1e+01 PROP
9 1e+01 PROP

In the last step, we calculate the real economic costs using the values in the ...EXP field and the multipliers in the mappingTable.

To achieve this goal, we loop over the rows of mappingTable and calculate a total column for both crop and prop damage using a custom function calcTotalDamage:

    df <- df %>% 
        left_join(select(mappingTableCrop,-Type),by=c("CROPDMGEXP"="EXP")) %>% 
        mutate(CROP_Multiplier=Multiplier) %>%
        select(-Multiplier) %>%
        left_join(select(mappingTableProp,-Type),by=c("PROPDMGEXP"="EXP")) %>% 
        mutate(PROP_Multiplier=Multiplier) %>%
        select(-Multiplier) %>%
        mutate(CROPDMG_TOTAL=CROPDMG*CROP_Multiplier) %>%
        mutate(PROPDMG_TOTAL=PROPDMG*PROP_Multiplier) 
    
    # sometimes, the total values are NA because not all rows contain valid PROP and CROP damages
    # some contain only PROP and no CROP (-->NA) or only CROP and no PROP (-->NA) values
    # we set the NA values to 0 to allow for ease of use
    df$CROPDMG_TOTAL[is.na(df$CROPDMG_TOTAL)] <- 0;
    df$PROPDMG_TOTAL[is.na(df$PROPDMG_TOTAL)] <- 0;

Some more row pruning and summarising after this total damage calculation:

    df <- df %>% select(BGN_YEAR,BGN_MONTH,EVTYPE,
                         FATALITIES,INJURIES,
                         CROPDMG_TOTAL,PROPDMG_TOTAL)

    # now group by year, month, and EVTYPE
    df <- df %>% group_by (BGN_YEAR,
                           BGN_MONTH,
                           EVTYPE) %>%
        summarise_all(funs(sum)) %>%
        ungroup
    # finally, what is the size of the resulting dataset?
    dim(df)
## [1] 5870    7

So we now end up with 5870 rows and 7 cols. Finally, a small dataset.

OK, so what is the togal damage to population health and the economy?

    # let's summarize the total damage to population health and economy:
    summary_table <- df %>% summarize(FATALITIES_SUM=sum(FATALITIES),
                     INJURIES_SUM=sum(INJURIES),
                     CROPDMG_SUM=sum(CROPDMG_TOTAL),
                     PROPDMG_SUM=sum(PROPDMG_TOTAL))
    kable(summary_table)
FATALITIES_SUM INJURIES_SUM CROPDMG_SUM PROPDMG_SUM
15081 139710 49028865010 425730811103

Apparently, we are talking billions here. So let’s reduce those numbers to something humanly readable

    df <- df %>% mutate(CROPDMG_TOTAL=CROPDMG_TOTAL/1000000000) %>%
    mutate(PROPDMG_TOTAL=PROPDMG_TOTAL/1000000000) 

That should result into something more readable:

    # let's summarize the total damage to population health and economy:
    summary_table <- df %>% summarize(FATALITIES_SUM=sum(FATALITIES),
                     INJURIES_SUM=sum(INJURIES),
                     CROPDMG_SUM=sum(CROPDMG_TOTAL),
                     PROPDMG_SUM=sum(PROPDMG_TOTAL))
    kable(summary_table)
FATALITIES_SUM INJURIES_SUM CROPDMG_SUM PROPDMG_SUM
15081 139710 49.02887 425.7308

The economic costs are given without inflation. At least, in the storm data pdf documentation, there is no mention of inflation. We are going to adjust for inflation by using the US Consumer Price Index and Inflation dataset.

    # inflation-adjustment
    download.file("https://datahub.io/core/cpi-us/r/cpiai.csv",destfile = "cpi.csv")
    cpi <- read_csv("cpi.csv")
    cpi <- cpi %>% mutate(year=year(ymd(Date)),month=month(ymd(Date)))
    df <- df %>% left_join(cpi,by=c("BGN_YEAR"="year","BGN_MONTH"="month"))
    df<- df %>% 
        mutate(PROPDMG_INFLADJUSTED_TOTAL=PROPDMG_TOTAL/Index) %>% 
        mutate(CROPDMG_INFLADJUSTED_TOTAL=CROPDMG_TOTAL/Index)
    # scale to dollars as of the end of the dataset
    df <- df %>% 
        mutate(PROPDMG_INFLADJUSTED_TOTAL = PROPDMG_INFLADJUSTED_TOTAL * df$Index[nrow(df)]) %>%
        mutate(CROPDMG_INFLADJUSTED_TOTAL = CROPDMG_INFLADJUSTED_TOTAL * df$Index[nrow(df)])
        # replace the TOTALs by the inflation adjusted values
    df <- df %>%
        mutate(PROPDMG_TOTAL=PROPDMG_INFLADJUSTED_TOTAL) %>%
        mutate(CROPDMG_TOTAL=CROPDMG_INFLADJUSTED_TOTAL) %>%
        select(-PROPDMG_INFLADJUSTED_TOTAL,-CROPDMG_INFLADJUSTED_TOTAL, -Index, -Inflation)
    
    
    summary_table <- df %>% summarize(FATALITIES_SUM=sum(FATALITIES),
                                      INJURIES_SUM=sum(INJURIES),
                                      CROPDMG_SUM=sum(CROPDMG_TOTAL),
                                      PROPDMG_SUM=sum(PROPDMG_TOTAL))
    
    kable(summary_table)
FATALITIES_SUM INJURIES_SUM CROPDMG_SUM PROPDMG_SUM
15081 139710 64.85206 583.4175

Finally, we are going to drop the month information as we are not going to use it in the following:

    # now group by year and EVTYPE
    df <- df %>% group_by (BGN_YEAR,
                           EVTYPE) %>%
        summarise_all(funs(sum)) %>%
        ungroup %>%
        select(-BGN_MONTH)

We are finishing this section with a dataset of dimensions 1374x7 and 485 different event types.

Event Categorization

The National Wheather Service Documentation (section 2.1.1) lists 48 event types. We saved the list of these event types to the txt file Eventtypes.csv. The data in the main dataframe df, however, contains 485 different events. Our aim now is to map the event types existing in the dataset to those that are defined in the documentation. First, we are going to set all strings to uppercase and repair some obvious misspellings/abbreviations/semantic mappings.

    d <- df$EVTYPE;
    d <- toupper(d);
    
    d <- gsub("TSTM|T.?U.?DERSTORM.?|THUNDERESTORM.?|THUNDEERSTORM.?|THUNERSTORM.?","THUNDERSTORM",d,ignore.case=TRUE);
    d <- gsub("LIGHTING|LIGNTNING","LIGHTNING",d,ignore.case=T);
    d <- gsub("TORNDAO","TORNADO",d,ignore.case=TRUE);
    d <- gsub("WINDS|WINS","WIND",d,ignore.case=TRUE);
    d <- gsub("SHOWER","RAIN",d,ignore.case=TRUE);
    d <- gsub("WINTRY","WINTER",d,ignore.case=TRUE);
    d <- gsub("Low temperature","COLD",d,ignore.case=TRUE);
    d <- gsub("FLD|FLOODING|FLOODS","FLOOD",d,ignore.case=TRUE);
    d <- gsub("AVALANCE","AVALANCHE",d,ignore.case=TRUE);
    d <- gsub("  "," ",d,ignore.case=TRUE);
    
    df$EVTYPE <- d;

Good. But we still have 404 event types. One way to map these event types to the 48 given by the documentation is to select the closest possible event type from the documentation to any given event type of the data. We can do that using Rs adist function which calculates the Levenshtein Distance between strings. We then choose the the documentation event type with the lowest distance to the data event type. In case of ties, we simply take the first occurence. At the end, we create a mapping table between data event types and documentation eventtypes.

    strEVTYPES <- read_csv("Eventtypes.csv",col_names=FALSE)$X1
    
    d <- unique(d) %>% sort();
    a <- adist(d,strEVTYPES,ignore.case = T) %>%
        apply(1,which.min) %>%
        sapply(function(x){strEVTYPES[x]})
    
    c <- data.frame(EventTypesInData=d,EventTypesInDoc=a)
    kable(c  %>% arrange(EventTypesInData) %>% head(30)) 
EventTypesInData EventTypesInDoc
? Hail
AGRICULTURAL FREEZE Frost/Freeze
APACHE COUNTY Waterspout
ASTRONOMICAL HIGH TIDE Astronomical Low Tide
ASTRONOMICAL LOW TIDE Astronomical Low Tide
AVALANCHE Avalanche
BEACH EROSION Heavy Rain
BLACK ICE Avalanche
BLIZZARD Blizzard
BLIZZARD/WINTER STORM Winter Storm
BLOWING DUST Blizzard
BLOWING SNOW Heavy Snow
BREAKUP FLOOD Flash Flood
BRUSH FIRE Wildfire
COASTAL EROSION Coastal Flood
COASTAL FLOOD Coastal Flood
COASTAL FLOOD/EROSION Coastal Flood
COASTAL STORM Coastal Flood
COASTAL SURGE Coastal Flood
COASTALSTORM Coastal Flood
COLD Flood
COLD AIR TORNADO Tornado
COLD AND SNOW Heavy Snow
COLD AND WET CONDITIONS Cold/Wind Chill
COLD TEMPERATURE Cold/Wind Chill
COLD WAVE Wildfire
COLD WEATHER Winter Weather
COLD/WIND High Wind
COLD/WIND CHILL Cold/Wind Chill
COOL AND WET Avalanche

We see that the result is kind of mixed. Some values seem to be matched kind of ok, some are not right at all. We will therefore resort to a manual mapping of event types.

    mappingTable <- data.frame(EventTypeData=d,
                               EventTypeNew=rep(NA,length(d)));
    selector <- grepl("TORNADO",d,ignore.case=TRUE) & is.na(mappingTable$EventTypeNew);
    mappingTable$EventTypeNew[selector] <- "TORNADO"
    selector <- grepl("HURRICANE|TYPHOON",d,ignore.case=TRUE)  & is.na(mappingTable$EventTypeNew);
    mappingTable$EventTypeNew[selector] <- "HURRICANE"
    selector <- grepl("THUNDERSTORM",d,ignore.case=TRUE)  & is.na(mappingTable$EventTypeNew);
    mappingTable$EventTypeNew[selector] <- "THUNDERSTORM"
    selector <- grepl("FLOOD|FLD|FLOODING|FLOODS",d,ignore.case=TRUE)  & is.na(mappingTable$EventTypeNew);
    mappingTable$EventTypeNew[selector] <- "FLOOD"
    selector <- grepl("DROUGHT",d,ignore.case=TRUE)  & is.na(mappingTable$EventTypeNew);
    mappingTable$EventTypeNew[selector] <- "DROUGHT"
    selector <- grepl("HEAT|.*hyperthermia.*|unseasonably warm.*|.*warm weather.*",d,ignore.case=TRUE)  & is.na(mappingTable$EventTypeNew);
    mappingTable$EventTypeNew[selector] <- "HEAT"
    selector <- grepl("COOL|COLD|.*hypothermia.*",d,ignore.case=TRUE)  & is.na(mappingTable$EventTypeNew);
    mappingTable$EventTypeNew[selector] <- "COLD"
    selector <- grepl("STORM",d,ignore.case=TRUE)  & is.na(mappingTable$EventTypeNew);
    mappingTable$EventTypeNew[selector] <- "STORM"
    selector <- grepl("WIND|NON.*THUNDERSTORM.*",d,ignore.case=TRUE)  & is.na(mappingTable$EventTypeNew);
    mappingTable$EventTypeNew[selector] <- "HIGH WIND"
    selector <- grepl("HAIL",d,ignore.case=TRUE)  & is.na(mappingTable$EventTypeNew);
    mappingTable$EventTypeNew[selector] <- "HAIL"
    selector <- grepl("RAIN|PRECIP.*",d,ignore.case=TRUE)  & is.na(mappingTable$EventTypeNew);
    mappingTable$EventTypeNew[selector] <- "RAIN"
    selector <- grepl("ICE|ICY",d,ignore.case=TRUE)  & is.na(mappingTable$EventTypeNew);
    mappingTable$EventTypeNew[selector] <- "SNOW/ICE"
    selector <- grepl("SNOW",d,ignore.case=TRUE)  & is.na(mappingTable$EventTypeNew);
    mappingTable$EventTypeNew[selector] <- "SNOW/ICE"
    selector <- grepl("FREEZ.*",d,ignore.case=TRUE)  & is.na(mappingTable$EventTypeNew);
    mappingTable$EventTypeNew[selector] <- "FROST/FREEZE"
    selector <- grepl("FROST.*",d,ignore.case=TRUE)  & is.na(mappingTable$EventTypeNew);
    mappingTable$EventTypeNew[selector] <- "FROST/FREEZE"
    selector <- grepl("TSUNAMI",d,ignore.case=TRUE)  & is.na(mappingTable$EventTypeNew);
    mappingTable$EventTypeNew[selector] <- "TSUNAMI"
    selector <- grepl("FIRE",d,ignore.case=TRUE)  & is.na(mappingTable$EventTypeNew);
    mappingTable$EventTypeNew[selector] <- "FIRE"
    selector <- grepl("WINTER",d,ignore.case=TRUE)  & is.na(mappingTable$EventTypeNew);
    mappingTable$EventTypeNew[selector] <- "WINTER"
    selector <- grepl("LIGHTNING",d,ignore.case=TRUE)  & is.na(mappingTable$EventTypeNew);
    mappingTable$EventTypeNew[selector] <- "LIGHTNING"
    selector <- grepl("FOG",d,ignore.case=TRUE)  & is.na(mappingTable$EventTypeNew);
    mappingTable$EventTypeNew[selector] <- "FOG"
    selector <- grepl("BLIZZARD",d,ignore.case=TRUE)  & is.na(mappingTable$EventTypeNew);
    mappingTable$EventTypeNew[selector] <- "BLIZZARD"
    selector <- grepl("AVALANCHE",d,ignore.case=TRUE)  & is.na(mappingTable$EventTypeNew);
    mappingTable$EventTypeNew[selector] <- "AVALANCHE"
    selector <- grepl("rip current.*",d,ignore.case=TRUE)  & is.na(mappingTable$EventTypeNew);
    mappingTable$EventTypeNew[selector] <- "RIP CURRENT"
    selector <- grepl("SURF",d,ignore.case=TRUE)  & is.na(mappingTable$EventTypeNew);
    mappingTable$EventTypeNew[selector] <- "HIGH SURF"
    selector <- grepl("LANDSLIDE",d,ignore.case=TRUE)  & is.na(mappingTable$EventTypeNew);
    mappingTable$EventTypeNew[selector] <- "LANDSLIDE"
    
    kable(mappingTable  %>% arrange(EventTypeData) %>% head(30)) 
EventTypeData EventTypeNew
? NA
AGRICULTURAL FREEZE FROST/FREEZE
APACHE COUNTY NA
ASTRONOMICAL HIGH TIDE NA
ASTRONOMICAL LOW TIDE NA
AVALANCHE AVALANCHE
BEACH EROSION NA
BLACK ICE SNOW/ICE
BLIZZARD BLIZZARD
BLIZZARD/WINTER STORM STORM
BLOWING DUST NA
BLOWING SNOW SNOW/ICE
BREAKUP FLOOD FLOOD
BRUSH FIRE FIRE
COASTAL EROSION NA
COASTAL FLOOD FLOOD
COASTAL FLOOD/EROSION FLOOD
COASTAL STORM STORM
COASTAL SURGE NA
COASTALSTORM STORM
COLD COLD
COLD AIR TORNADO TORNADO
COLD AND SNOW COLD
COLD AND WET CONDITIONS COLD
COLD TEMPERATURE COLD
COLD WAVE COLD
COLD WEATHER COLD
COLD/WIND COLD
COLD/WIND CHILL COLD
COOL AND WET COLD

That seems better. There are still 13.12% data event types unmapped. Let’s have a look what percentage on population health and economic damage those represent.

    # apply the mappingTable
    df2 <- df %>% left_join(mappingTable,
                            by=c("EVTYPE"="EventTypeData"))
    ddf <- df2 %>% mutate(chooser=is.na(EventTypeNew)) %>% group_by(chooser) %>%
        summarize(sumFAT=sum(FATALITIES),
                  sumINJ=sum(INJURIES),
                  sumDMGCROP=sum(CROPDMG_TOTAL),
                  sumDMGPROP=sum(PROPDMG_TOTAL));
    kable(ddf)
chooser sumFAT sumINJ sumDMGCROP sumDMGPROP
FALSE 15028 139362 64.6399213 583.3668594
TRUE 53 348 0.2121368 0.0506244

Now, that seems reasonable. We have 24 event types, of which one contains all the unmatched data event types, but we succeeded to have more than 99% in each damage category within our mapped 23 damage categories. We think we can continue with the resulting dataset.

    df <- df2 %>% mutate(Event_Type=EventTypeNew) %>%
        select(-EVTYPE,EventTypeNew) %>%
        group_by (BGN_YEAR, Event_Type) %>%
        summarize(FATALITIES=sum(FATALITIES),
        INJURIES=sum(INJURIES),
        CROPDMG=sum(CROPDMG_TOTAL),
        PROPDMG=sum(PROPDMG_TOTAL)) %>%
    ungroup()

We have succeded to reduce the dataset to an 488x6 dataset still containing the essentials of the storm data. With this, we conclude the data preprocessing and will finally do some plots.

Results

Impact on population Health

Let’s first analyse which event type has the strongest impact on population health:

    # prepare the dataset for plotting
    df <- df %>% gather(Damage_Type, Damage_Value, FATALITIES:PROPDMG,factor_key = TRUE)
    df <- df %>%
        arrange(BGN_YEAR,Event_Type,Damage_Type,Damage_Value)
    df <- df %>% mutate(Damage_Category=ifelse(Damage_Type %in% c("FATALITIES","INJURIES"),
                                               "population Health",
                                               "Economic Damage"))
    df <- df %>% replace_na(list(Event_Type = "Other")) %>%
        mutate(Damage_Category=as.factor(Damage_Category)) %>%
        mutate(Event_Type=ifelse(is.na(Event_Type),"Other",Event_Type)) %>%
        mutate(Event_Type=as.factor(Event_Type)) 

    # the dataset for this specific plot    
    df_Plot <- df %>%
        subset(Damage_Category=="population Health") %>%
        group_by (Event_Type,Damage_Type) %>%
        summarise("N"=sum(Damage_Value)) %>%
        ungroup() %>%
        mutate (N2=ifelse(Damage_Type=="FATALITIES",-N,N/10)) %>% 
        left_join(df %>%
                      filter (Damage_Type=="FATALITIES") %>% 
                      select(BGN_YEAR,Event_Type,Damage_Value) %>% 
                      mutate(FAT=Damage_Value) %>% 
                      select (-Damage_Value) %>%
                      group_by(Event_Type) %>%
                      summarize(FAT=sum(FAT)) %>%
                      ungroup()) %>%
        arrange(desc(FAT)) %>%
        head(n=20)
    
    brks <- seq(-7500, 12500, 2500)
    lbls <- (as.character(c(seq(7500,0,-2500),seq(25000,125000,25000))))
    
    # the plot
    df_Plot %>%
        ggplot(aes(x=reorder(Event_Type,+FAT),
                   y=N2,
                   fill=Damage_Type)) + 
        geom_bar(stat="identity",width=0.6) + 
        geom_text(data=df_Plot %>%
                      filter(Damage_Type=="FATALITIES") %>%
                      arrange(desc(N)),
                  aes(label = N), hjust = 1., size=3,color = "#000000") +
        geom_text(data=df_Plot %>%
                      filter(Damage_Type=="INJURIES") %>%
                      arrange(desc(N)),
                  aes(label = N), hjust = 0., size=3,color = "#000000") +
        scale_y_continuous(breaks = brks, 
                           labels=lbls, 
                           limits=c(-7500,12500)) +
        scale_fill_manual(values = c("red", "blue")) +
        labs(x="Event Type",
             y="Number of affected people",
             title="Top 10 population health impacting event types",
             subtitle="Ordered by decreasing number of fatalities") +
        guides(fill=guide_legend(title="New Legend Title")) + 
        theme(legend.title=element_blank()) +
        theme(plot.title = element_text(hjust = 0.5),
              plot.subtitle = element_text(hjust=0.5)) +
        coord_flip()

We can see that Tornados, Heat and Floods have the strongest impact on population Health, Both for Fatalities and Injuries. The most devastating effects on population health are concentrated on just a few event types, the top three cover already more than 2/3 of the total fatalities for all event types. A question that arises is: what is the percentage of total Casualties covered by these Eventtypes? What is the percentage of the total damage in other damage categories (CROP, PROP, ECONOMY=CROP+PROP), INJURIES) covered by these event types?. Let’s quantify that:

# total damage
df_total_damage <- df %>% group_by(Damage_Type) %>% summarise(Total=sum(Damage_Value))

myfunc_createOrderedDFOfEventTypes <- function(DT,df){
    if (DT == "ECONOMY"){
        DT2 <- c("CROPDMG","PROPDMG")
    }
    else{
        DT2 <- c(DT)
    }
    
    df %>% 
        filter(Damage_Type %in% DT2) %>% 
        group_by (Event_Type) %>% 
        summarize(Damage_Value=sum(Damage_Value)) %>% 
        arrange(desc(Damage_Value)) %>% 
        ungroup() 
}


# Event Types ordered wrt all Damage Types, all years
DTs <- toupper(c("Fatalities","Injuries","PropDmg","PropDmg","Economy"))
l_OrderedEventTypes <- lapply(DTs,function(DT){myfunc_createOrderedDFOfEventTypes(DT,df)})
names(l_OrderedEventTypes) <- DTs

myfunc_createSummaryDF <- function(df_total_damage,df,l_OrderedEventTypes,n,DT){
    df_total_damage %>% left_join(
        df %>% filter (Event_Type %in% (
            l_OrderedEventTypes[[DT]] %>% head(n))$Event_Type) %>% 
            group_by (Damage_Type) %>% 
            summarize(Damage_Value=sum(Damage_Value)) 
    ) %>% 
        mutate(!!sprintf("Percentage damage",n,"CROP"):=Damage_Value/Total*100)
}

# impact of top n eventtypes
n <- 3
DT <- "FATALITIES"
ddf <- myfunc_createSummaryDF(df_total_damage,df,l_OrderedEventTypes,n,DT)
ddf$Total <- round(ddf$Total)
ddf$Damage_Value <- round(ddf$Damage_Value)
kable(ddf, "html",
      caption=sprintf("Impact of the top %d event types wrt %s on total values",n,DT)) %>%
    kable_styling(full_width = F)
Impact of the top 3 event types wrt FATALITIES on total values
Damage_Type Total Damage_Value Percentage damage
FATALITIES 15081 10325 68.46363
INJURIES 139710 108512 77.66946
CROPDMG 65 18 28.30610
PROPDMG 583 340 58.29518

OK, so the top three event types for Fatalities cover more than 2/3 of all Fatalities and all Injuries, more than half of the Property damage, and still more than 25% of the Crop Damage. We can confidently say that Tornados, Heat and Flood have the most devastating effects on population Health while they have a very strong impact on the economy.

Impact on the economy

Now, let’s do the same analysis for economic consequences. In this case, we will consider the strongest impacts on the economy, which we consider to be PROPDMG+CROPDMG. Let’s do the plot:

    df$Damage_Type <- factor(df$Damage_Type, levels(df$Damage_Type)[c(1,2,4,3)])
    

    # the dataset for this specific plot    
    df_Plot <- df %>%
        subset(Damage_Category=="Economic Damage") %>%
        group_by (Event_Type,Damage_Type) %>%
        summarise("N"=sum(Damage_Value)) %>%
        ungroup() %>%
        mutate (N2=ifelse(Damage_Type=="PROPDMG",-N,N*10)) %>% 
        left_join(df %>%
                      filter (Damage_Type %in% c("PROPDMG","CROPDMG")) %>% 
                      select(BGN_YEAR,Event_Type,Damage_Value) %>% 
                      mutate(ECONOMICDMG=Damage_Value) %>% 
                      select (-Damage_Value) %>%
                      group_by(Event_Type) %>%
                      summarize(ECONOMICDMG=sum(ECONOMICDMG)) %>%
                      ungroup()) %>%
        arrange(desc(ECONOMICDMG)) %>%
        head(n=20)
    
    brks <- seq(-250, 200, 50)
    lbls <- (as.character(c(seq(250,0,-50),seq(50,200,50))))
    
    # the plot
    df_Plot %>%
        ggplot(aes(x=reorder(Event_Type,+N),
                   y=N2,
                   fill=Damage_Type)) + 
        geom_bar(stat="identity",width=0.6) + 
        geom_text(data=df_Plot %>%
                      filter(Damage_Type=="PROPDMG") %>%
                      arrange(desc(N)),
                  aes(label = round(N,1)), hjust = 1., size=3,color = "#000000") +
        geom_text(data=df_Plot %>%
                      filter(Damage_Type=="CROPDMG") %>%
                      arrange(desc(N)),
                  aes(label = round(N,1)), hjust = 0., size=3,color = "#000000") +
        scale_y_continuous(breaks = brks, 
                           labels=lbls, 
                           limits=c(-250,200)) +
        scale_fill_manual(values = c("red", "blue"),
                          labels = c("Property Damage","Crop Damage")) +
        labs(x="Event Type",
             y="Damage in Billion Dollars, inflation adjusted",
             title="Top 10 event types impacting the economy",
             subtitle="Ordered by decreasing value of economic consequences") +
        guides(fill=guide_legend(title="New Legend Title")) + 
        theme(legend.title=element_blank()) +
        theme(plot.title = element_text(hjust = 0.5),
              plot.subtitle = element_text(hjust=0.5)) +
        coord_flip()

It appears that the most devastating event types wrt economic loss are Floods, Tornados, and Hurricanes. The first two (Tornados and Floods) are also in the top 3 of the most devasttating event types for population health. The third one, Hurricanes, has a strong impact on the economy, but a low impact on population health (outside the top 10 event types wrt population health). Droughts are a special case as they have the strongest impact of all event types on Crop damage, but a much lower impact on Property and population health (outside the top 10).

Let’s again have a look on the impact of the top 3 event types for economic consequences wrt the other damage categories:

n <- 3
DT <- "ECONOMY"
ddf <- myfunc_createSummaryDF(df_total_damage,df,l_OrderedEventTypes,n,DT)
ddf$Total <- round(ddf$Total)
ddf$Damage_Value <- round(ddf$Damage_Value)
kable(ddf, "html",
      caption=sprintf("Impact of the top %d event types wrt %s on total values",n,DT)) %>%
    kable_styling(full_width = F)
Impact of the top 3 event types wrt ECONOMY on total values
Damage_Type Total Damage_Value Percentage damage
FATALITIES 15081 7289 48.33234
INJURIES 139710 100617 72.01847
CROPDMG 65 24 37.22013
PROPDMG 583 442 75.76431

It is apparent that in this case, the top 3 event types do not explain as much of the total damage as they did in the case of the consequences on population health. The economic consequences are more spread out among the event types when compared to the consequences on population health. Let’s consider the top 10 event types instead of the top 3:

n <- 10
DT <- "ECONOMY"
ddf <- myfunc_createSummaryDF(df_total_damage,df,l_OrderedEventTypes,n,DT)
ddf$Total <- round(ddf$Total)
ddf$Damage_Value <- round(ddf$Damage_Value)
kable(ddf, "html",
      caption=sprintf("Impact of the top %d event types wrt %s on total values",n,DT)) %>%
    kable_styling(full_width = F)
Impact of the top 10 event types wrt ECONOMY on total values
Damage_Type Total Damage_Value Percentage damage
FATALITIES 15081 9137 60.58617
INJURIES 139710 119637 85.63238
CROPDMG 65 59 90.32327
PROPDMG 583 579 99.19829

So the top 10 event types cover far more of the total value wrt damage categories not including Fatalities (more than 85%), but wrt the impact on Fatalities they still only explain 60%.

So we can conclude the following:

  • Two event types, Floods and Tornados, are within the top three event types wrt both economic and population health consequences
  • Hurricanes have a strong impact on the economy (rank 3), but a rather low impact on pubic health (not inside the top 10 events)
  • Heat has a rather high impact on population health (rank 3), but a rather low impact on the economy (outside the top 10 events)
  • Droughts are a special case inasmuch as they have a strong impact on Crop damage, but low impact on both population health as well as property