Extreme weather events like storms and hurricanes can have severe consequences on both population health and the economy. One key concern of the administration is therefore to prevent negative consequences of such events. The U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database collects the impact of such weather events on U.S. soil. Analysis of this database allows the administration to decide how and where to best spend available resources.
This report explores this storm database and answers the following questions:
Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?
This report shows that tornados are most harmful for population health, while floods result in the largest economic loss.
First, load some libraries we are going to use in the rest of the analysis
require(readr)
require(dplyr)
require(tidyr)
require(knitr)
require(lubridate)
require(ggplot2)
require(kableExtra)
We are now downloading the NOAA storm database and read it into an R dataframe:
URL <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2";
FN <- "StormData.csv.bz2";
download.file(URL,destfile = FN);
df <- read_csv(FN);
df_orig <- df;
Many of the rows are not interesting for us as they have neither human nor economic consequences. We prune the corresponding rows.
df <- df %>%
filter(FATALITIES > 0 | INJURIES > 0 | CROPDMG > 0 | PROPDMG > 0 )
Before this pruning, there were 902297 rows. After the pruning, there are 254633, a reduction by 71.78%. Some of the rows appear to be duplicates. There are many rows with identical content except for the REFNUM column. We will eliminate these duplicates.
dupesWhenIgnoringRefnum <- df %>%
select(-REFNUM) %>%
duplicated();
df <- df %>%
mutate(dupesWhenIgnoringRefnum=dupesWhenIgnoringRefnum);
df_WithDuplicateInformation <- df;
# after this operation, df does not contain duplicates due to REFNUM
df <- df_WithDuplicateInformation %>%
filter(dupesWhenIgnoringRefnum==FALSE) %>%
select(-dupesWhenIgnoringRefnum)
We still have 253894 rows, but we achieved a further reduction by 0.29%. There are probably far more duplicates. For example, some rows are identical except for some field being NA in one row, while the duplicate candidate has some value in the same field. There are also some duplicate candidates where the only difference is some different spelling in the REMARKS field. However, we can not generalize this finding - sometimes, there is more than just a spelling difference, sometimes, there is a different BGN_LOCATI and so on. As we do not know exactly the origin of the duplicates, we will not continue to further prune the dataset due to duplicates.
Some more pruning, this time for the columns:
# select only columns that we need
df <- df %>% select(BGN_DATE,BGN_TIME,STATE,COUNTYNAME,
EVTYPE,
FATALITIES,INJURIES,
CROPDMG,CROPDMGEXP,
PROPDMG,PROPDMGEXP,
REFNUM,REMARKS)
# We will only work with year and month of the event
df <- df %>% mutate(BGN_YEAR=year(mdy_hms(BGN_DATE)),
BGN_MONTH=month(mdy_hms(BGN_DATE)))
The data contains fields describing economic (CROPDMG,PROPDMG) and human (FATALITIES,INJURIES) consequences. The values for CROPDMG and PROPDMG are associated with a multiplier (CROPDMGEXP, PROPDMGEXP). The associated information (section 2.7) only describes K (thousands) M (Millions) and B (Billions) as possible multipliers. However, insspecting the dataset reveals that far more mulitpliers have been used.
strHere <- "CROPDMGEXP";
paste(c(strHere,": ",unique(df[[strHere]])), collapse=" ")
## [1] "CROPDMGEXP : NA M K m B ? 0 k"
strHere <- "PROPDMGEXP";
paste(c(strHere,": ",unique(df[[strHere]])), collapse=" ")
## [1] "PROPDMGEXP : K M NA B m + 0 5 6 4 h 2 7 3 H -"
In the following, we will assume that lowercase letters should be uppercase:
strHere <- "CROPDMGEXP";
paste(c(strHere,": ",unique(df[[strHere]])), collapse=" ")
## [1] "CROPDMGEXP : NA M K m B ? 0 k"
strHere <- "PROPDMGEXP";
paste(c(strHere,": ",unique(df[[strHere]])), collapse=" ")
## [1] "PROPDMGEXP : K M NA B m + 0 5 6 4 h 2 7 3 H -"
# cast lowercase letters to uppercase letters
df$CROPDMGEXP <- toupper(df$CROPDMGEXP);
df$PROPDMGEXP <- toupper(df$PROPDMGEXP);
strHere <- "CROPDMGEXP";
paste(c(strHere,": ",unique(df[[strHere]])), collapse=" ")
## [1] "CROPDMGEXP : NA M K B ? 0"
strHere <- "PROPDMGEXP";
paste(c(strHere,": ",unique(df[[strHere]])), collapse=" ")
## [1] "PROPDMGEXP : K M NA B + 0 5 6 4 H 2 7 3 -"
There are still a number of multipliers that we do not understand. One way to fix this problem is to go to the NOAA website and download interactively pieces of data that correspond to our missing multipliers.
require(dplyr)
require(knitr)
# CROPDMGEXP is NA
ddf <- df[is.na(df$CROPDMGEXP) & df$CROPDMG > 0,] %>%
select(BGN_DATE,BGN_TIME,STATE,COUNTYNAME,CROPDMG,CROPDMGEXP,PROPDMG,PROPDMGEXP);
kable(ddf)
| BGN_DATE | BGN_TIME | STATE | COUNTYNAME | CROPDMG | CROPDMGEXP | PROPDMG | PROPDMGEXP |
|---|---|---|---|---|---|---|---|
| 7/4/1994 0:00:00 | 0400 | ND | STUTSMAN | 3 | NA | 5 | K |
| 4/5/1994 0:00:00 | 1700 | TX | HAYS | 4 | NA | 5 | M |
| 4/15/1994 0:00:00 | 1630 | TX | MEDINA | 4 | NA | 500 | K |
We download the corresponding data and find:
| Location | County/Zone | St. | Date | Time | T.Z. | Type | Mag | Dth | Inj | PrD | CrD |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Jamestown | STUTSMAN CO. | ND | 07/04/1994 | 04:00 | CST | Hail | 1.75 in. | 0 | 0 | 5.00K | 0.00K |
| Jamestown | STUTSMAN CO. | ND | 07/04/1994 | 04:55 | CST | Hail | 1.25 in. | 0 | 0 | 5.00K | 5.00K |
We conclude that there was no crop damage for this event. Possibly, all NA values indicate that the corresponding Crop Damage should be zero? Downloading the data for the Texas case confirms that:
| Location | County/Zone | St. | Date | Time | T.Z. | Type | Mag | Dth | Inj | PrD | CrD |
|---|---|---|---|---|---|---|---|---|---|---|---|
| San Marcos | HAYS CO. | TX | 04/05/1994 | 17:00 | CST | Thunderstorm Wind | 52 kts. | 0 | 0 | 5.000M | 0.00K |
We therefore conclude that cases where is.na(CROPDMGEXP) & CROPDMG>0 is TRUE is in fact a multiplier of 0.
Now we are going to look at the ? cases:
# CROPDMGEXP is ?
selector <- which(df$CROPDMGEXP == "?" & df$CROPDMG >0);
length(selector);
## [1] 0
We see that there are no cases with CROPDMGEXP=="?" & CROPDMG >0 as the length of this selector is 0.
The last case for crops is the 0. Let’s have a look at that:
# CROPDMGEXP=="0"
selector <- which(df$CROPDMGEXP == "0" & df$CROPDMG >0);
ddf <- df[selector,] %>%
select(BGN_DATE,BGN_TIME,STATE,COUNTYNAME,CROPDMG,CROPDMGEXP,PROPDMG,PROPDMGEXP);
kable(ddf)
| BGN_DATE | BGN_TIME | STATE | COUNTYNAME | CROPDMG | CROPDMGEXP | PROPDMG | PROPDMGEXP |
|---|---|---|---|---|---|---|---|
| 7/17/1995 0:00:00 | 0258 | GA | CAMDEN | 20 | 0 | 0 | NA |
| 10/6/1994 0:00:00 | 2105 | IA | BOONE | 5 | 0 | 5 | K |
| 5/27/1995 0:00:00 | 1901 | IA | CRAWFORD | 50 | 0 | 100 | K |
| 10/6/1994 0:00:00 | 1630 | IA | HANCOCK | 25 | 0 | 5 | K |
| 10/6/1994 0:00:00 | 2225 | IA | MITCHELL | 25 | 0 | 25 | K |
| 5/27/1995 0:00:00 | 2119 | IA | RINGGOLD | 60 | 0 | 30 | K |
| 10/6/1994 0:00:00 | 2038 | IA | WEBSTER | 5 | 0 | 2 | K |
| 10/6/1994 0:00:00 | 2039 | IA | WEBSTER | 5 | 0 | 5 | K |
| 5/27/1995 0:00:00 | 2057 | IA | WEBSTER | 50 | 0 | 40 | K |
| 10/6/1994 0:00:00 | 2123 | IA | WINNEBAGO | 5 | 0 | 5 | K |
| 10/6/1994 0:00:00 | 2135 | IA | WINNEBAGO | 5 | 0 | 2 | K |
| 10/6/1994 0:00:00 | 2101 | IA | WRIGHT | 5 | 0 | 2 | K |
So there are 12 cases for this case. Let’s have a look at two cases:
| Location | County/Zone | St. | Date | Time | T.Z. | Type | Mag | Dth | Inj | PrD | CrD |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Spring Bluff | CAMDEN CO. | GA | 07/17/1995 | 02:58 | EST | Hail | 0.75 in. | 0 | 0 | 0.00K | 0.20K |
| Mt Ayr to | RINGGOLD CO. | IA | 05/27/1995 | 21:19 | CST | Tornado | F1 | 0 | 0 | 30.00K | 0.60K |
These tow cases suggest that CROPDMGEXP=="0" corresponds to a multiplier of 10. Due to temporal constraints we are not investigating any further and get the following table for a mapping between CROPDMGEXP and multipliers:
EXP <- c("K","M","B","0","?","NA");
Multiplier <- c(1000,1e6,1e9,10,0,0);
Type <- rep("CROP",6);
mappingTableCrop <- data.frame(EXP=EXP, Multiplier=Multiplier, Type=Type);
kable(mappingTableCrop)
| EXP | Multiplier | Type |
|---|---|---|
| K | 1e+03 | CROP |
| M | 1e+06 | CROP |
| B | 1e+09 | CROP |
| 0 | 1e+01 | CROP |
| ? | 0e+00 | CROP |
| NA | 0e+00 | CROP |
Let’s do the same thing for the Property damages. There are quite a bit more. We start with NA.
# PROPDMGEXP is "NA"
selector <- which(is.na(df$PROPDMGEXP) & df$PROPDMG >0);
ddf <- df[selector,] %>%
select(BGN_DATE,BGN_TIME,STATE,COUNTYNAME,CROPDMG,CROPDMGEXP,PROPDMG,PROPDMGEXP);
kable(head(ddf))
| BGN_DATE | BGN_TIME | STATE | COUNTYNAME | CROPDMG | CROPDMGEXP | PROPDMG | PROPDMGEXP |
|---|---|---|---|---|---|---|---|
| 3/9/1995 0:00:00 | 0301 | CA | MARIN | 0 | ? | 0.41 | NA |
| 1/8/1993 0:00:00 | 1130 | FL | UNION | 0 | NA | 3.00 | NA |
| 8/19/1995 0:00:00 | 1700 | GA | DECATUR | 0 | NA | 2.00 | NA |
| 3/31/1993 0:00:00 | 2015 | GA | HART | 0 | NA | 4.00 | NA |
| 5/12/1993 0:00:00 | 1630 | IN | CASS | 0 | NA | 4.00 | NA |
| 5/14/1995 0:00:00 | 0300 | IN | VANDERBURGH | 0 | NA | 10.00 | NA |
There are 76 cases, but we only show the first 6 ones. A sample from the downloadable Data yields the following:
| Location | County/Zone | St. | Date | Time | T.Z. | Type | Mag | Dth | Inj | PrD | CrD |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Gainesville | UNION CO. | FL | 01/08/1993 | 11:30 | EST | Tornado | F0 | 0 | 0 | 0.00K | 0.00K |
| CASS CO. | CASS CO. | IN | 05/12/1993 | 16:30 | EST | Thunderstorm Wind | 0 kts. | 0 | 0 | 0.00K | 0.00K |
We therefore conclude that NA values for PROPDMGEXP corresponds to a multiplier of 0.
We continue with H (we suspect this corresponds to 100):
# PROPDMGEXP == "H"
selector <- which(df$PROPDMGEXP=="H" & df$PROPDMG >0);
ddf <- df[selector,] %>%
select(BGN_DATE,BGN_TIME,STATE,COUNTYNAME,CROPDMG,CROPDMGEXP,PROPDMG,PROPDMGEXP);
kable(ddf)
| BGN_DATE | BGN_TIME | STATE | COUNTYNAME | CROPDMG | CROPDMGEXP | PROPDMG | PROPDMGEXP |
|---|---|---|---|---|---|---|---|
| 9/16/1994 0:00:00 | 1630 | MI | CLINTON | 0 | NA | 2 | H |
| 7/14/1995 0:00:00 | 1923 | NE | SHERMAN | 0 | NA | 5 | H |
| 1/14/1995 0:00:00 | 1050 | SC | LAURENS | 0 | NA | 5 | H |
| 3/8/1995 0:00:00 | 1305 | SC | LAURENS | 0 | NA | 2 | H |
| 1/19/1995 0:00:00 | 1815 | SC | PICKENS | 0 | NA | 3 | H |
| 7/12/1995 0:00:00 | 1300 | WY | BIG HORN | 0 | NA | 5 | H |
| 7/15/1995 0:00:00 | 1818 | WY | CAMPBELL | 0 | NA | 5 | H |
There are 7 cases in total. A sample from the downloadable Data yields the following:
| Location | County/Zone | St. | Date | Time | T.Z. | Type | Mag | Dth | Inj | PrD | CrD |
|---|---|---|---|---|---|---|---|---|---|---|---|
| St. Johns | CLINTON CO. | MI | 09/16/1994 | 16:30 | EST | Thunderstorm Wind | 50 kts. | 0 | 0 | 0.20K | 0.00K |
| BIG HORN CO. | BIG HORN CO. | WY | 07/12/1995 | 13:00 | MST | Thunderstorm Wind | 54 kts. | 0 | 0 | 0.50K | 0.00K |
The data confirm our suspicion for the two samples. We therefore conclude that H values for PROPDMGEXP corresponds to a multiplier of 100.
We continue with -:
# PROPDMGEXP == "-"
selector <- which(df$PROPDMGEXP=="-" & df$PROPDMG >0);
ddf <- df[selector,] %>%
select(BGN_DATE,BGN_TIME,STATE,COUNTYNAME,CROPDMG,CROPDMGEXP,PROPDMG,PROPDMGEXP);
kable(ddf)
| BGN_DATE | BGN_TIME | STATE | COUNTYNAME | CROPDMG | CROPDMGEXP | PROPDMG | PROPDMGEXP |
|---|---|---|---|---|---|---|---|
| 12/12/1995 0:00:00 | 1000 | OR | ORZ004 - 05 - 06 - 08 - 09 | 0 | NA | 15 | - |
There is only 1 case in total. The row seems to be corrupted as there is no proper county name. A search in the downloadable Data yields no hits for this date and Oregon. We therefore conclude that - corresponds to a multiplier of 0.
We continue with +:
# PROPDMGEXP == "+"
selector <- which(df$PROPDMGEXP=="+" & df$PROPDMG >0);
ddf <- df[selector,] %>%
select(BGN_DATE,BGN_TIME,STATE,COUNTYNAME,CROPDMG,CROPDMGEXP,PROPDMG,PROPDMGEXP);
kable(ddf)
| BGN_DATE | BGN_TIME | STATE | COUNTYNAME | CROPDMG | CROPDMGEXP | PROPDMG | PROPDMGEXP |
|---|---|---|---|---|---|---|---|
| 5/1/1995 0:00:00 | 0000 | AK | AKZ001 | 0 | NA | 20 | + |
| 12/1/1994 0:00:00 | 0000 | AK | AKZ024 - 006 - 005 | 0 | NA | 20 | + |
| 3/9/1995 0:00:00 | 1000 | CA | CAZ001 | 0 | NA | 2 | + |
| 6/16/1995 0:00:00 | 0330 | NV | NVZ001 - 002 - 003 - 004 - 005 | 0 | NA | 15 | + |
| 6/5/1995 0:00:00 | 1304 | NV | NVZ003 - 004 - 007 | 0 | NA | 60 | + |
There are 5 cases in total. Again, the rows seem corrupted due to strange county names. We checked row 1 and 3 on the online NOAA database and did not find any hits corresponding to the entries in the table above. We therefore conclude that - corresponds to a multiplier of 0.
We continue with 0 (we suspect this corresponds to 10 as in the case of crops):
# PROPDMGEXP == "0"
selector <- which(df$PROPDMGEXP=="0" & df$PROPDMG >0);
ddf <- df[selector,] %>%
select(BGN_DATE,BGN_TIME,STATE,COUNTYNAME,CROPDMG,CROPDMGEXP,PROPDMG,PROPDMGEXP);
kable(head(ddf,n=30))
| BGN_DATE | BGN_TIME | STATE | COUNTYNAME | CROPDMG | CROPDMGEXP | PROPDMG | PROPDMGEXP |
|---|---|---|---|---|---|---|---|
| 12/3/1994 0:00:00 | 0000 | AK | AKZ024 - 006 - 005 | 0 | NA | 150.0 | 0 |
| 8/10/1994 0:00:00 | 2100 | CO | JEFFERSON | 0 | NA | 50.0 | 0 |
| 8/10/1994 0:00:00 | 2130 | CO | JEFFERSON | 0 | NA | 5.0 | 0 |
| 8/11/1994 0:00:00 | 0030 | CO | JEFFERSON | 0 | NA | 37.0 | 0 |
| 8/12/1994 0:00:00 | 1300 | CA | RIVERSIDE | 0 | NA | 0.5 | 0 |
| 8/13/1994 0:00:00 | 1300 | CA | RIVERSIDE | 0 | NA | 0.5 | 0 |
| 8/10/1994 0:00:00 | 2100 | CO | BOULDER | 0 | NA | 1.0 | 0 |
| 8/10/1994 0:00:00 | 2120 | CO | BOULDER | 0 | NA | 1.0 | 0 |
| 8/1/1994 0:00:00 | 1415 | CO | CLEAR CREEK | 0 | NA | 50.0 | 0 |
| 8/13/1994 0:00:00 | 1345 | CO | GILPIN | 0 | NA | 1.5 | 0 |
| 5/2/1995 0:00:00 | 1820 | CO | JEFFERSON | 0 | NA | 20.0 | 0 |
| 8/10/1994 0:00:00 | 2100 | CO | LARIMER | 0 | NA | 50.0 | 0 |
| 8/19/1994 0:00:00 | 1700 | CO | MESA | 0 | NA | 5.0 | 0 |
| 8/19/1994 0:00:00 | 2030 | CO | MESA | 0 | NA | 50.0 | 0 |
| 8/12/1994 0:00:00 | 1630 | CO | MORGAN | 0 | NA | 2.0 | 0 |
| 8/4/1994 0:00:00 | 1615 | CO | PARK | 0 | NA | 4.0 | 0 |
| 8/1/1994 0:00:00 | 1700 | CO | PHILLIPS | 0 | NA | 50.0 | 0 |
| 8/13/1994 0:00:00 | 1415 | CO | PUEBLO | 0 | NA | 1.0 | 0 |
| 9/4/1994 0:00:00 | 0000 | CO | PUEBLO | 0 | NA | 25.0 | 0 |
| 5/2/1995 0:00:00 | 1510 | CO | WELD | 0 | NA | 1.0 | 0 |
| 4/4/1995 0:00:00 | 1322 | CT | FAIRFIELD | 0 | NA | 25.0 | 0 |
| 6/20/1995 0:00:00 | 1520 | CT | HARTFORD | 0 | NA | 20.0 | 0 |
| 6/20/1995 0:00:00 | 1535 | CT | HARTFORD | 0 | NA | 10.0 | 0 |
| 6/20/1995 0:00:00 | 1630 | CT | MIDDLESEX | 0 | NA | 0.7 | 0 |
| 4/4/1995 0:00:00 | 1320 | CT | NEW HAVEN | 0 | NA | 25.0 | 0 |
| 5/11/1995 0:00:00 | 1145 | FL | COLUMBIA | 0 | NA | 15.0 | 0 |
| 2/4/1995 0:00:00 | 0600 | FL | HILLSBOROUGH | 0 | NA | 10.0 | 0 |
| 2/13/1995 0:00:00 | 2048 | FL | INDIAN RIVER | 0 | NA | 50.0 | 0 |
| 2/3/1995 0:00:00 | 2210 | FL | MARION | 0 | NA | 25.0 | 0 |
| 2/13/1995 0:00:00 | 2115 | FL | ST. LUCIE | 0 | NA | 50.0 | 0 |
There are 209 cases in total, displayed are 30. The first row seems to be corrupted. A sample from the downloadable Data yields the following:
| Location | County/Zone | St. | Date | Time | T.Z. | Type | Mag | Dth | Inj | PrD | CrD |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Valle Vista | RIVERSIDE CO. | CA | 08/12/1994 | 13:00 | PST | Tornado | F0 | 0 | 0 | 0.00K | 0.00K |
| Shelton | FAIRFIELD CO. | CT | 04/04/1995 | 13:22 | EST | Thunderstorm | Wind | 0 kts. | 0 | 0 | 0.25K |
There is no clear-cut conclusion possible. The sample from Connecticut indicates a multiplier of 10, the sample from California a multiplier of 0. We tried to find several other rows from the dataset with no hits in the NOAA online dataset. We finally conclude, mostly for consistency reasons with the crops, that 0 values for PROPDMGEXP corresponds to a multiplier of 10.
We continue with cases larger than 0:
# PROPDMGEXP == 0..10
selector <- which(df$PROPDMGEXP %in% as.character(c(1:10)) & df$PROPDMG >0);
ddf <- df[selector,] %>%
select(BGN_DATE,BGN_TIME,STATE,COUNTYNAME,CROPDMG,CROPDMGEXP,PROPDMG,PROPDMGEXP);
kable(ddf)
| BGN_DATE | BGN_TIME | STATE | COUNTYNAME | CROPDMG | CROPDMGEXP | PROPDMG | PROPDMGEXP |
|---|---|---|---|---|---|---|---|
| 2/8/1995 0:00:00 | 1800 | CO | COZ003>005 - 009 - 010 - 012 - 013 - 015 - 017 - 018 - 023 - 033>036 - 038>040 - 060 - 061 - 065 | 0.0 | NA | 1.7 | 5 |
| 6/7/1995 0:00:00 | 1905 | CO | WELD | 25.0 | M | 30.0 | 5 |
| 4/4/1995 0:00:00 | 1400 | CT | MIDDLESEX | 0.0 | NA | 0.1 | 5 |
| 5/27/1995 0:00:00 | 1620 | IL | CLINTON | 0.0 | NA | 24.0 | 6 |
| 6/20/1995 0:00:00 | 1830 | IL | CLINTON | 0.0 | NA | 10.0 | 5 |
| 5/27/1995 0:00:00 | 1715 | IL | GREENE | 0.0 | NA | 14.0 | 5 |
| 6/20/1995 0:00:00 | 1900 | IL | MACOUPIN | 0.0 | NA | 12.0 | 5 |
| 6/8/1995 0:00:00 | 0612 | IL | MADISON | 0.0 | NA | 26.0 | 6 |
| 5/18/1995 0:00:00 | 1255 | IL | MARION | 0.0 | NA | 15.0 | 6 |
| 5/18/1995 0:00:00 | 1137 | IL | MONROE | 0.0 | NA | 88.0 | 5 |
| 5/18/1995 0:00:00 | 1140 | IL | MONROE | 0.0 | NA | 12.0 | 5 |
| 5/27/1995 0:00:00 | 1717 | IL | MONTGOMERY | 0.0 | NA | 10.0 | 4 |
| 9/7/1994 0:00:00 | 0115 | KS | DECATUR | 0.0 | NA | 3.0 | 5 |
| 7/28/1995 0:00:00 | 0145 | MA | BRISTOL | 0.0 | NA | 0.2 | 5 |
| 6/8/1995 0:00:00 | 0459 | MO | CALLAWAY | 0.0 | NA | 12.0 | 2 |
| 6/7/1995 0:00:00 | 1210 | MO | FRANKLIN | 0.0 | NA | 14.0 | 7 |
| 5/16/1995 0:00:00 | 1750 | MO | SHELBY | 2.5 | K | 20.0 | 3 |
| 5/16/1995 0:00:00 | 0510 | MO | ST. LOUIS | 0.0 | NA | 12.0 | 5 |
| 5/27/1995 0:00:00 | 1600 | NE | NEZ006>010 | 0.0 | NA | 2.2 | 4 |
| 5/15/1995 0:00:00 | 1730 | NC | NEW HANOVER | 0.0 | NA | 1.0 | 5 |
| 7/4/1995 0:00:00 | 1930 | NC | NORTHAMPTON | 0.0 | NA | 68.0 | 7 |
| 6/21/1995 0:00:00 | 2030 | PA | BEAVER | 0.0 | NA | 1.0 | 5 |
| 8/18/1994 0:00:00 | 1500 | PA | LYCOMING | 0.0 | NA | 2.2 | 4 |
| 4/13/1995 0:00:00 | 0000 | SD | SDZ004>008 - 010 - 011 - 016>018 - 034>037 - 051 | 0.0 | NA | 5.2 | 5 |
| 5/17/1995 0:00:00 | 0055 | TX | PARMER | 0.0 | NA | 0.2 | 5 |
| 3/11/1995 0:00:00 | 1118 | UT | WASHINGTON | 0.0 | NA | 0.7 | 5 |
| 3/11/1995 0:00:00 | 1625 | UT | WASHINGTON | 0.0 | NA | 0.1 | 4 |
| 7/27/1995 0:00:00 | 2020 | VA | HENRICO | 0.0 | NA | 13.0 | 5 |
| 6/22/1995 0:00:00 | 1950 | VA | ROCKBRIDGE | 430.0 | K | 6.4 | 5 |
There are 29 cases in total, all are displayed. Some rows, again, seem to be corrupted. A sample from the downloadable Data yields the following:
| Location | County/Zone | St. | Date | Time | T.Z. | Type | Mag | Dth | Inj | PrD | CrD |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Gilcrest | WELD CO. | CO | 06/07/1995 | 19:05 | MST | Hail | 0.50 in. | 0 | 0 | 0.30K | 25.000M |
| St.Rose | CLINTON CO. | IL | 05/27/1995 | 16:20 | CST | Thunderstorm Wind | 0 kts. | 0 | 0 | 0.25K | 0.00K |
| Alton | MADISON CO. | IL | 06/08/1995 | 06:12 | CST | Thunderstorm Wind | 0 kts. | 0 | 0 | 0.27K | 0.00K |
| Shamrock | CALLAWAY CO. | MO | 06/08/1995 | 04:59 | CST | Thunderstorm Wind | 0 kts. | 0 | 0 | 0.12K | 0.00K |
There is no clear-cut conclusion possible. The multipliers can take on different values between 0 and 10.4166667. We tried to find several other rows from the dataset with no hits in the NOAA online dataset. We finally conclude, mostly for consistency reasons with the crops, that any numeric values for PROPDMGEXP corresponds to a multiplier of 10.
This therefore yields a mapping table as follows:
EXP <- c("K","M","B","NA","+","-","?","0","1","2","3","4","5","6","7","8","9");
Multiplier <- c(1000,1e6,1e9,0,0,0,0,10,10,10,10,10,10,10,10,10,10);
Type <- rep("PROP",17);
mappingTableProp <- data.frame(EXP=EXP, Multiplier=Multiplier, Type=Type);
mappingTable <- rbind(mappingTableCrop,mappingTableProp);
kable(mappingTable)
| EXP | Multiplier | Type |
|---|---|---|
| K | 1e+03 | CROP |
| M | 1e+06 | CROP |
| B | 1e+09 | CROP |
| 0 | 1e+01 | CROP |
| ? | 0e+00 | CROP |
| NA | 0e+00 | CROP |
| K | 1e+03 | PROP |
| M | 1e+06 | PROP |
| B | 1e+09 | PROP |
| NA | 0e+00 | PROP |
| + | 0e+00 | PROP |
| - | 0e+00 | PROP |
| ? | 0e+00 | PROP |
| 0 | 1e+01 | PROP |
| 1 | 1e+01 | PROP |
| 2 | 1e+01 | PROP |
| 3 | 1e+01 | PROP |
| 4 | 1e+01 | PROP |
| 5 | 1e+01 | PROP |
| 6 | 1e+01 | PROP |
| 7 | 1e+01 | PROP |
| 8 | 1e+01 | PROP |
| 9 | 1e+01 | PROP |
In the last step, we calculate the real economic costs using the values in the ...EXP field and the multipliers in the mappingTable.
To achieve this goal, we loop over the rows of mappingTable and calculate a total column for both crop and prop damage using a custom function calcTotalDamage:
df <- df %>%
left_join(select(mappingTableCrop,-Type),by=c("CROPDMGEXP"="EXP")) %>%
mutate(CROP_Multiplier=Multiplier) %>%
select(-Multiplier) %>%
left_join(select(mappingTableProp,-Type),by=c("PROPDMGEXP"="EXP")) %>%
mutate(PROP_Multiplier=Multiplier) %>%
select(-Multiplier) %>%
mutate(CROPDMG_TOTAL=CROPDMG*CROP_Multiplier) %>%
mutate(PROPDMG_TOTAL=PROPDMG*PROP_Multiplier)
# sometimes, the total values are NA because not all rows contain valid PROP and CROP damages
# some contain only PROP and no CROP (-->NA) or only CROP and no PROP (-->NA) values
# we set the NA values to 0 to allow for ease of use
df$CROPDMG_TOTAL[is.na(df$CROPDMG_TOTAL)] <- 0;
df$PROPDMG_TOTAL[is.na(df$PROPDMG_TOTAL)] <- 0;
Some more row pruning and summarising after this total damage calculation:
df <- df %>% select(BGN_YEAR,BGN_MONTH,EVTYPE,
FATALITIES,INJURIES,
CROPDMG_TOTAL,PROPDMG_TOTAL)
# now group by year, month, and EVTYPE
df <- df %>% group_by (BGN_YEAR,
BGN_MONTH,
EVTYPE) %>%
summarise_all(funs(sum)) %>%
ungroup
# finally, what is the size of the resulting dataset?
dim(df)
## [1] 5870 7
So we now end up with 5870 rows and 7 cols. Finally, a small dataset.
OK, so what is the togal damage to population health and the economy?
# let's summarize the total damage to population health and economy:
summary_table <- df %>% summarize(FATALITIES_SUM=sum(FATALITIES),
INJURIES_SUM=sum(INJURIES),
CROPDMG_SUM=sum(CROPDMG_TOTAL),
PROPDMG_SUM=sum(PROPDMG_TOTAL))
kable(summary_table)
| FATALITIES_SUM | INJURIES_SUM | CROPDMG_SUM | PROPDMG_SUM |
|---|---|---|---|
| 15081 | 139710 | 49028865010 | 425730811103 |
Apparently, we are talking billions here. So let’s reduce those numbers to something humanly readable
df <- df %>% mutate(CROPDMG_TOTAL=CROPDMG_TOTAL/1000000000) %>%
mutate(PROPDMG_TOTAL=PROPDMG_TOTAL/1000000000)
That should result into something more readable:
# let's summarize the total damage to population health and economy:
summary_table <- df %>% summarize(FATALITIES_SUM=sum(FATALITIES),
INJURIES_SUM=sum(INJURIES),
CROPDMG_SUM=sum(CROPDMG_TOTAL),
PROPDMG_SUM=sum(PROPDMG_TOTAL))
kable(summary_table)
| FATALITIES_SUM | INJURIES_SUM | CROPDMG_SUM | PROPDMG_SUM |
|---|---|---|---|
| 15081 | 139710 | 49.02887 | 425.7308 |
The economic costs are given without inflation. At least, in the storm data pdf documentation, there is no mention of inflation. We are going to adjust for inflation by using the US Consumer Price Index and Inflation dataset.
# inflation-adjustment
download.file("https://datahub.io/core/cpi-us/r/cpiai.csv",destfile = "cpi.csv")
cpi <- read_csv("cpi.csv")
cpi <- cpi %>% mutate(year=year(ymd(Date)),month=month(ymd(Date)))
df <- df %>% left_join(cpi,by=c("BGN_YEAR"="year","BGN_MONTH"="month"))
df<- df %>%
mutate(PROPDMG_INFLADJUSTED_TOTAL=PROPDMG_TOTAL/Index) %>%
mutate(CROPDMG_INFLADJUSTED_TOTAL=CROPDMG_TOTAL/Index)
# scale to dollars as of the end of the dataset
df <- df %>%
mutate(PROPDMG_INFLADJUSTED_TOTAL = PROPDMG_INFLADJUSTED_TOTAL * df$Index[nrow(df)]) %>%
mutate(CROPDMG_INFLADJUSTED_TOTAL = CROPDMG_INFLADJUSTED_TOTAL * df$Index[nrow(df)])
# replace the TOTALs by the inflation adjusted values
df <- df %>%
mutate(PROPDMG_TOTAL=PROPDMG_INFLADJUSTED_TOTAL) %>%
mutate(CROPDMG_TOTAL=CROPDMG_INFLADJUSTED_TOTAL) %>%
select(-PROPDMG_INFLADJUSTED_TOTAL,-CROPDMG_INFLADJUSTED_TOTAL, -Index, -Inflation)
summary_table <- df %>% summarize(FATALITIES_SUM=sum(FATALITIES),
INJURIES_SUM=sum(INJURIES),
CROPDMG_SUM=sum(CROPDMG_TOTAL),
PROPDMG_SUM=sum(PROPDMG_TOTAL))
kable(summary_table)
| FATALITIES_SUM | INJURIES_SUM | CROPDMG_SUM | PROPDMG_SUM |
|---|---|---|---|
| 15081 | 139710 | 64.85206 | 583.4175 |
Finally, we are going to drop the month information as we are not going to use it in the following:
# now group by year and EVTYPE
df <- df %>% group_by (BGN_YEAR,
EVTYPE) %>%
summarise_all(funs(sum)) %>%
ungroup %>%
select(-BGN_MONTH)
We are finishing this section with a dataset of dimensions 1374x7 and 485 different event types.
The National Wheather Service Documentation (section 2.1.1) lists 48 event types. We saved the list of these event types to the txt file Eventtypes.csv. The data in the main dataframe df, however, contains 485 different events. Our aim now is to map the event types existing in the dataset to those that are defined in the documentation. First, we are going to set all strings to uppercase and repair some obvious misspellings/abbreviations/semantic mappings.
d <- df$EVTYPE;
d <- toupper(d);
d <- gsub("TSTM|T.?U.?DERSTORM.?|THUNDERESTORM.?|THUNDEERSTORM.?|THUNERSTORM.?","THUNDERSTORM",d,ignore.case=TRUE);
d <- gsub("LIGHTING|LIGNTNING","LIGHTNING",d,ignore.case=T);
d <- gsub("TORNDAO","TORNADO",d,ignore.case=TRUE);
d <- gsub("WINDS|WINS","WIND",d,ignore.case=TRUE);
d <- gsub("SHOWER","RAIN",d,ignore.case=TRUE);
d <- gsub("WINTRY","WINTER",d,ignore.case=TRUE);
d <- gsub("Low temperature","COLD",d,ignore.case=TRUE);
d <- gsub("FLD|FLOODING|FLOODS","FLOOD",d,ignore.case=TRUE);
d <- gsub("AVALANCE","AVALANCHE",d,ignore.case=TRUE);
d <- gsub(" "," ",d,ignore.case=TRUE);
df$EVTYPE <- d;
Good. But we still have 404 event types. One way to map these event types to the 48 given by the documentation is to select the closest possible event type from the documentation to any given event type of the data. We can do that using Rs adist function which calculates the Levenshtein Distance between strings. We then choose the the documentation event type with the lowest distance to the data event type. In case of ties, we simply take the first occurence. At the end, we create a mapping table between data event types and documentation eventtypes.
strEVTYPES <- read_csv("Eventtypes.csv",col_names=FALSE)$X1
d <- unique(d) %>% sort();
a <- adist(d,strEVTYPES,ignore.case = T) %>%
apply(1,which.min) %>%
sapply(function(x){strEVTYPES[x]})
c <- data.frame(EventTypesInData=d,EventTypesInDoc=a)
kable(c %>% arrange(EventTypesInData) %>% head(30))
| EventTypesInData | EventTypesInDoc |
|---|---|
| ? | Hail |
| AGRICULTURAL FREEZE | Frost/Freeze |
| APACHE COUNTY | Waterspout |
| ASTRONOMICAL HIGH TIDE | Astronomical Low Tide |
| ASTRONOMICAL LOW TIDE | Astronomical Low Tide |
| AVALANCHE | Avalanche |
| BEACH EROSION | Heavy Rain |
| BLACK ICE | Avalanche |
| BLIZZARD | Blizzard |
| BLIZZARD/WINTER STORM | Winter Storm |
| BLOWING DUST | Blizzard |
| BLOWING SNOW | Heavy Snow |
| BREAKUP FLOOD | Flash Flood |
| BRUSH FIRE | Wildfire |
| COASTAL EROSION | Coastal Flood |
| COASTAL FLOOD | Coastal Flood |
| COASTAL FLOOD/EROSION | Coastal Flood |
| COASTAL STORM | Coastal Flood |
| COASTAL SURGE | Coastal Flood |
| COASTALSTORM | Coastal Flood |
| COLD | Flood |
| COLD AIR TORNADO | Tornado |
| COLD AND SNOW | Heavy Snow |
| COLD AND WET CONDITIONS | Cold/Wind Chill |
| COLD TEMPERATURE | Cold/Wind Chill |
| COLD WAVE | Wildfire |
| COLD WEATHER | Winter Weather |
| COLD/WIND | High Wind |
| COLD/WIND CHILL | Cold/Wind Chill |
| COOL AND WET | Avalanche |
We see that the result is kind of mixed. Some values seem to be matched kind of ok, some are not right at all. We will therefore resort to a manual mapping of event types.
mappingTable <- data.frame(EventTypeData=d,
EventTypeNew=rep(NA,length(d)));
selector <- grepl("TORNADO",d,ignore.case=TRUE) & is.na(mappingTable$EventTypeNew);
mappingTable$EventTypeNew[selector] <- "TORNADO"
selector <- grepl("HURRICANE|TYPHOON",d,ignore.case=TRUE) & is.na(mappingTable$EventTypeNew);
mappingTable$EventTypeNew[selector] <- "HURRICANE"
selector <- grepl("THUNDERSTORM",d,ignore.case=TRUE) & is.na(mappingTable$EventTypeNew);
mappingTable$EventTypeNew[selector] <- "THUNDERSTORM"
selector <- grepl("FLOOD|FLD|FLOODING|FLOODS",d,ignore.case=TRUE) & is.na(mappingTable$EventTypeNew);
mappingTable$EventTypeNew[selector] <- "FLOOD"
selector <- grepl("DROUGHT",d,ignore.case=TRUE) & is.na(mappingTable$EventTypeNew);
mappingTable$EventTypeNew[selector] <- "DROUGHT"
selector <- grepl("HEAT|.*hyperthermia.*|unseasonably warm.*|.*warm weather.*",d,ignore.case=TRUE) & is.na(mappingTable$EventTypeNew);
mappingTable$EventTypeNew[selector] <- "HEAT"
selector <- grepl("COOL|COLD|.*hypothermia.*",d,ignore.case=TRUE) & is.na(mappingTable$EventTypeNew);
mappingTable$EventTypeNew[selector] <- "COLD"
selector <- grepl("STORM",d,ignore.case=TRUE) & is.na(mappingTable$EventTypeNew);
mappingTable$EventTypeNew[selector] <- "STORM"
selector <- grepl("WIND|NON.*THUNDERSTORM.*",d,ignore.case=TRUE) & is.na(mappingTable$EventTypeNew);
mappingTable$EventTypeNew[selector] <- "HIGH WIND"
selector <- grepl("HAIL",d,ignore.case=TRUE) & is.na(mappingTable$EventTypeNew);
mappingTable$EventTypeNew[selector] <- "HAIL"
selector <- grepl("RAIN|PRECIP.*",d,ignore.case=TRUE) & is.na(mappingTable$EventTypeNew);
mappingTable$EventTypeNew[selector] <- "RAIN"
selector <- grepl("ICE|ICY",d,ignore.case=TRUE) & is.na(mappingTable$EventTypeNew);
mappingTable$EventTypeNew[selector] <- "SNOW/ICE"
selector <- grepl("SNOW",d,ignore.case=TRUE) & is.na(mappingTable$EventTypeNew);
mappingTable$EventTypeNew[selector] <- "SNOW/ICE"
selector <- grepl("FREEZ.*",d,ignore.case=TRUE) & is.na(mappingTable$EventTypeNew);
mappingTable$EventTypeNew[selector] <- "FROST/FREEZE"
selector <- grepl("FROST.*",d,ignore.case=TRUE) & is.na(mappingTable$EventTypeNew);
mappingTable$EventTypeNew[selector] <- "FROST/FREEZE"
selector <- grepl("TSUNAMI",d,ignore.case=TRUE) & is.na(mappingTable$EventTypeNew);
mappingTable$EventTypeNew[selector] <- "TSUNAMI"
selector <- grepl("FIRE",d,ignore.case=TRUE) & is.na(mappingTable$EventTypeNew);
mappingTable$EventTypeNew[selector] <- "FIRE"
selector <- grepl("WINTER",d,ignore.case=TRUE) & is.na(mappingTable$EventTypeNew);
mappingTable$EventTypeNew[selector] <- "WINTER"
selector <- grepl("LIGHTNING",d,ignore.case=TRUE) & is.na(mappingTable$EventTypeNew);
mappingTable$EventTypeNew[selector] <- "LIGHTNING"
selector <- grepl("FOG",d,ignore.case=TRUE) & is.na(mappingTable$EventTypeNew);
mappingTable$EventTypeNew[selector] <- "FOG"
selector <- grepl("BLIZZARD",d,ignore.case=TRUE) & is.na(mappingTable$EventTypeNew);
mappingTable$EventTypeNew[selector] <- "BLIZZARD"
selector <- grepl("AVALANCHE",d,ignore.case=TRUE) & is.na(mappingTable$EventTypeNew);
mappingTable$EventTypeNew[selector] <- "AVALANCHE"
selector <- grepl("rip current.*",d,ignore.case=TRUE) & is.na(mappingTable$EventTypeNew);
mappingTable$EventTypeNew[selector] <- "RIP CURRENT"
selector <- grepl("SURF",d,ignore.case=TRUE) & is.na(mappingTable$EventTypeNew);
mappingTable$EventTypeNew[selector] <- "HIGH SURF"
selector <- grepl("LANDSLIDE",d,ignore.case=TRUE) & is.na(mappingTable$EventTypeNew);
mappingTable$EventTypeNew[selector] <- "LANDSLIDE"
kable(mappingTable %>% arrange(EventTypeData) %>% head(30))
| EventTypeData | EventTypeNew |
|---|---|
| ? | NA |
| AGRICULTURAL FREEZE | FROST/FREEZE |
| APACHE COUNTY | NA |
| ASTRONOMICAL HIGH TIDE | NA |
| ASTRONOMICAL LOW TIDE | NA |
| AVALANCHE | AVALANCHE |
| BEACH EROSION | NA |
| BLACK ICE | SNOW/ICE |
| BLIZZARD | BLIZZARD |
| BLIZZARD/WINTER STORM | STORM |
| BLOWING DUST | NA |
| BLOWING SNOW | SNOW/ICE |
| BREAKUP FLOOD | FLOOD |
| BRUSH FIRE | FIRE |
| COASTAL EROSION | NA |
| COASTAL FLOOD | FLOOD |
| COASTAL FLOOD/EROSION | FLOOD |
| COASTAL STORM | STORM |
| COASTAL SURGE | NA |
| COASTALSTORM | STORM |
| COLD | COLD |
| COLD AIR TORNADO | TORNADO |
| COLD AND SNOW | COLD |
| COLD AND WET CONDITIONS | COLD |
| COLD TEMPERATURE | COLD |
| COLD WAVE | COLD |
| COLD WEATHER | COLD |
| COLD/WIND | COLD |
| COLD/WIND CHILL | COLD |
| COOL AND WET | COLD |
That seems better. There are still 13.12% data event types unmapped. Let’s have a look what percentage on population health and economic damage those represent.
# apply the mappingTable
df2 <- df %>% left_join(mappingTable,
by=c("EVTYPE"="EventTypeData"))
ddf <- df2 %>% mutate(chooser=is.na(EventTypeNew)) %>% group_by(chooser) %>%
summarize(sumFAT=sum(FATALITIES),
sumINJ=sum(INJURIES),
sumDMGCROP=sum(CROPDMG_TOTAL),
sumDMGPROP=sum(PROPDMG_TOTAL));
kable(ddf)
| chooser | sumFAT | sumINJ | sumDMGCROP | sumDMGPROP |
|---|---|---|---|---|
| FALSE | 15028 | 139362 | 64.6399213 | 583.3668594 |
| TRUE | 53 | 348 | 0.2121368 | 0.0506244 |
Now, that seems reasonable. We have 24 event types, of which one contains all the unmatched data event types, but we succeeded to have more than 99% in each damage category within our mapped 23 damage categories. We think we can continue with the resulting dataset.
df <- df2 %>% mutate(Event_Type=EventTypeNew) %>%
select(-EVTYPE,EventTypeNew) %>%
group_by (BGN_YEAR, Event_Type) %>%
summarize(FATALITIES=sum(FATALITIES),
INJURIES=sum(INJURIES),
CROPDMG=sum(CROPDMG_TOTAL),
PROPDMG=sum(PROPDMG_TOTAL)) %>%
ungroup()
We have succeded to reduce the dataset to an 488x6 dataset still containing the essentials of the storm data. With this, we conclude the data preprocessing and will finally do some plots.
Let’s first analyse which event type has the strongest impact on population health:
# prepare the dataset for plotting
df <- df %>% gather(Damage_Type, Damage_Value, FATALITIES:PROPDMG,factor_key = TRUE)
df <- df %>%
arrange(BGN_YEAR,Event_Type,Damage_Type,Damage_Value)
df <- df %>% mutate(Damage_Category=ifelse(Damage_Type %in% c("FATALITIES","INJURIES"),
"population Health",
"Economic Damage"))
df <- df %>% replace_na(list(Event_Type = "Other")) %>%
mutate(Damage_Category=as.factor(Damage_Category)) %>%
mutate(Event_Type=ifelse(is.na(Event_Type),"Other",Event_Type)) %>%
mutate(Event_Type=as.factor(Event_Type))
# the dataset for this specific plot
df_Plot <- df %>%
subset(Damage_Category=="population Health") %>%
group_by (Event_Type,Damage_Type) %>%
summarise("N"=sum(Damage_Value)) %>%
ungroup() %>%
mutate (N2=ifelse(Damage_Type=="FATALITIES",-N,N/10)) %>%
left_join(df %>%
filter (Damage_Type=="FATALITIES") %>%
select(BGN_YEAR,Event_Type,Damage_Value) %>%
mutate(FAT=Damage_Value) %>%
select (-Damage_Value) %>%
group_by(Event_Type) %>%
summarize(FAT=sum(FAT)) %>%
ungroup()) %>%
arrange(desc(FAT)) %>%
head(n=20)
brks <- seq(-7500, 12500, 2500)
lbls <- (as.character(c(seq(7500,0,-2500),seq(25000,125000,25000))))
# the plot
df_Plot %>%
ggplot(aes(x=reorder(Event_Type,+FAT),
y=N2,
fill=Damage_Type)) +
geom_bar(stat="identity",width=0.6) +
geom_text(data=df_Plot %>%
filter(Damage_Type=="FATALITIES") %>%
arrange(desc(N)),
aes(label = N), hjust = 1., size=3,color = "#000000") +
geom_text(data=df_Plot %>%
filter(Damage_Type=="INJURIES") %>%
arrange(desc(N)),
aes(label = N), hjust = 0., size=3,color = "#000000") +
scale_y_continuous(breaks = brks,
labels=lbls,
limits=c(-7500,12500)) +
scale_fill_manual(values = c("red", "blue")) +
labs(x="Event Type",
y="Number of affected people",
title="Top 10 population health impacting event types",
subtitle="Ordered by decreasing number of fatalities") +
guides(fill=guide_legend(title="New Legend Title")) +
theme(legend.title=element_blank()) +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust=0.5)) +
coord_flip()
We can see that Tornados, Heat and Floods have the strongest impact on population Health, Both for Fatalities and Injuries. The most devastating effects on population health are concentrated on just a few event types, the top three cover already more than 2/3 of the total fatalities for all event types. A question that arises is: what is the percentage of total Casualties covered by these Eventtypes? What is the percentage of the total damage in other damage categories (CROP, PROP, ECONOMY=CROP+PROP), INJURIES) covered by these event types?. Let’s quantify that:
# total damage
df_total_damage <- df %>% group_by(Damage_Type) %>% summarise(Total=sum(Damage_Value))
myfunc_createOrderedDFOfEventTypes <- function(DT,df){
if (DT == "ECONOMY"){
DT2 <- c("CROPDMG","PROPDMG")
}
else{
DT2 <- c(DT)
}
df %>%
filter(Damage_Type %in% DT2) %>%
group_by (Event_Type) %>%
summarize(Damage_Value=sum(Damage_Value)) %>%
arrange(desc(Damage_Value)) %>%
ungroup()
}
# Event Types ordered wrt all Damage Types, all years
DTs <- toupper(c("Fatalities","Injuries","PropDmg","PropDmg","Economy"))
l_OrderedEventTypes <- lapply(DTs,function(DT){myfunc_createOrderedDFOfEventTypes(DT,df)})
names(l_OrderedEventTypes) <- DTs
myfunc_createSummaryDF <- function(df_total_damage,df,l_OrderedEventTypes,n,DT){
df_total_damage %>% left_join(
df %>% filter (Event_Type %in% (
l_OrderedEventTypes[[DT]] %>% head(n))$Event_Type) %>%
group_by (Damage_Type) %>%
summarize(Damage_Value=sum(Damage_Value))
) %>%
mutate(!!sprintf("Percentage damage",n,"CROP"):=Damage_Value/Total*100)
}
# impact of top n eventtypes
n <- 3
DT <- "FATALITIES"
ddf <- myfunc_createSummaryDF(df_total_damage,df,l_OrderedEventTypes,n,DT)
ddf$Total <- round(ddf$Total)
ddf$Damage_Value <- round(ddf$Damage_Value)
kable(ddf, "html",
caption=sprintf("Impact of the top %d event types wrt %s on total values",n,DT)) %>%
kable_styling(full_width = F)
| Damage_Type | Total | Damage_Value | Percentage damage |
|---|---|---|---|
| FATALITIES | 15081 | 10325 | 68.46363 |
| INJURIES | 139710 | 108512 | 77.66946 |
| CROPDMG | 65 | 18 | 28.30610 |
| PROPDMG | 583 | 340 | 58.29518 |
OK, so the top three event types for Fatalities cover more than 2/3 of all Fatalities and all Injuries, more than half of the Property damage, and still more than 25% of the Crop Damage. We can confidently say that Tornados, Heat and Flood have the most devastating effects on population Health while they have a very strong impact on the economy.
Now, let’s do the same analysis for economic consequences. In this case, we will consider the strongest impacts on the economy, which we consider to be PROPDMG+CROPDMG. Let’s do the plot:
df$Damage_Type <- factor(df$Damage_Type, levels(df$Damage_Type)[c(1,2,4,3)])
# the dataset for this specific plot
df_Plot <- df %>%
subset(Damage_Category=="Economic Damage") %>%
group_by (Event_Type,Damage_Type) %>%
summarise("N"=sum(Damage_Value)) %>%
ungroup() %>%
mutate (N2=ifelse(Damage_Type=="PROPDMG",-N,N*10)) %>%
left_join(df %>%
filter (Damage_Type %in% c("PROPDMG","CROPDMG")) %>%
select(BGN_YEAR,Event_Type,Damage_Value) %>%
mutate(ECONOMICDMG=Damage_Value) %>%
select (-Damage_Value) %>%
group_by(Event_Type) %>%
summarize(ECONOMICDMG=sum(ECONOMICDMG)) %>%
ungroup()) %>%
arrange(desc(ECONOMICDMG)) %>%
head(n=20)
brks <- seq(-250, 200, 50)
lbls <- (as.character(c(seq(250,0,-50),seq(50,200,50))))
# the plot
df_Plot %>%
ggplot(aes(x=reorder(Event_Type,+N),
y=N2,
fill=Damage_Type)) +
geom_bar(stat="identity",width=0.6) +
geom_text(data=df_Plot %>%
filter(Damage_Type=="PROPDMG") %>%
arrange(desc(N)),
aes(label = round(N,1)), hjust = 1., size=3,color = "#000000") +
geom_text(data=df_Plot %>%
filter(Damage_Type=="CROPDMG") %>%
arrange(desc(N)),
aes(label = round(N,1)), hjust = 0., size=3,color = "#000000") +
scale_y_continuous(breaks = brks,
labels=lbls,
limits=c(-250,200)) +
scale_fill_manual(values = c("red", "blue"),
labels = c("Property Damage","Crop Damage")) +
labs(x="Event Type",
y="Damage in Billion Dollars, inflation adjusted",
title="Top 10 event types impacting the economy",
subtitle="Ordered by decreasing value of economic consequences") +
guides(fill=guide_legend(title="New Legend Title")) +
theme(legend.title=element_blank()) +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust=0.5)) +
coord_flip()
It appears that the most devastating event types wrt economic loss are Floods, Tornados, and Hurricanes. The first two (Tornados and Floods) are also in the top 3 of the most devasttating event types for population health. The third one, Hurricanes, has a strong impact on the economy, but a low impact on population health (outside the top 10 event types wrt population health). Droughts are a special case as they have the strongest impact of all event types on Crop damage, but a much lower impact on Property and population health (outside the top 10).
Let’s again have a look on the impact of the top 3 event types for economic consequences wrt the other damage categories:
n <- 3
DT <- "ECONOMY"
ddf <- myfunc_createSummaryDF(df_total_damage,df,l_OrderedEventTypes,n,DT)
ddf$Total <- round(ddf$Total)
ddf$Damage_Value <- round(ddf$Damage_Value)
kable(ddf, "html",
caption=sprintf("Impact of the top %d event types wrt %s on total values",n,DT)) %>%
kable_styling(full_width = F)
| Damage_Type | Total | Damage_Value | Percentage damage |
|---|---|---|---|
| FATALITIES | 15081 | 7289 | 48.33234 |
| INJURIES | 139710 | 100617 | 72.01847 |
| CROPDMG | 65 | 24 | 37.22013 |
| PROPDMG | 583 | 442 | 75.76431 |
It is apparent that in this case, the top 3 event types do not explain as much of the total damage as they did in the case of the consequences on population health. The economic consequences are more spread out among the event types when compared to the consequences on population health. Let’s consider the top 10 event types instead of the top 3:
n <- 10
DT <- "ECONOMY"
ddf <- myfunc_createSummaryDF(df_total_damage,df,l_OrderedEventTypes,n,DT)
ddf$Total <- round(ddf$Total)
ddf$Damage_Value <- round(ddf$Damage_Value)
kable(ddf, "html",
caption=sprintf("Impact of the top %d event types wrt %s on total values",n,DT)) %>%
kable_styling(full_width = F)
| Damage_Type | Total | Damage_Value | Percentage damage |
|---|---|---|---|
| FATALITIES | 15081 | 9137 | 60.58617 |
| INJURIES | 139710 | 119637 | 85.63238 |
| CROPDMG | 65 | 59 | 90.32327 |
| PROPDMG | 583 | 579 | 99.19829 |
So the top 10 event types cover far more of the total value wrt damage categories not including Fatalities (more than 85%), but wrt the impact on Fatalities they still only explain 60%.
So we can conclude the following: