1 Synopsis
2 Data Processing
- 2.1 Correcting propdmgexp and cropdmgexp
- 2.2 Fixing evtype
3 Results
- 3.1 Across the United States, which types of events are most harmful to population health?
  - 3.1.1 Hurricanes?
  - 3.1.2 Why is there both heat and excessive heat?
- 3.2 Across the United States, which types of events have the greatest economic consequences?

Most recent update: 2022-01-15

1 Synopsis

This report is a final project for the course Reproducible Research by Johns Hopkins University.

In this analysis we will report on the top weather-related causes of human death & injury and property & crop damage. To calculate these sums we will use the NOAA’s data set of weather events and associated damage for events in the USA between 1950 and November 2011. We spend substantial effort correcting manual data entry inconsistencies in the evtype variable, and note that damage caused by hurricanes is split up into multiple categories, for instance flood and hail.

We find the most damaging weather event to human health is the tornado, and the most damaging weather events to property and crops are flooding and droughts, respectively.

2 Data Processing

We’ll load libraries first.

# data.table::fread() is also required. We won't load the entire
# package because its namespace overlaps some other functions we use.
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(kableExtra))  # table formatting tools

# Set global table styling options here by overriding kable()
kable <- function(data, ...) {
    knitr::kable(data, format.args = list(big.mark = ","), ...) %>% 
        kable_styling(full_width=FALSE, position="left")
}

Next we load the data and perform some basic data cleaning:

rename two variables with cumbersome names
remove two variables which contain no information
- county_end and countyendn are all 0 and NA, respectively.
standardize the values of evtype to lowercase with no extra spaces.
- to correct some probable manual data entry errors

df <- as_tibble(data.table::fread(file = "repdata_data_StormData.csv.bz2"))
df <- rename(df, state_num = STATE__, longitude_e = LONGITUDE_)
names(df) <- tolower(names(df))
# these variables are all 0 and all NA, respectively
df <- select(df, -c(county_end, countyendn))

# standardize `evtype` to lowercase, removing spaces at beginning and end
# and removing extra internal spaces
df$evtype <- tolower(str_squish(df$evtype))

2.1 Correcting `propdmgexp` and `cropdmgexp`

In the raw data, damage amounts are represented in an unusual way. They are expressed in scientific notation, with the base and exponent stored in separate variables.

The exponents, propdmgexp and cropdmgexp, are supposed to represent a power of 10, which is multiplied by propdmg or cropdmg respectively to express a final damage amount. However, they actually contain unusual character values:

sort(union(unique(df$propdmgexp), unique(df$cropdmgexp)))

##  [1] ""  "-" "?" "+" "0" "1" "2" "3" "4" "5" "6" "7" "8" "B" "h" "H" "k" "K" "m"
## [20] "M"

After some investigation, it appears varying data entry standards or data entry error has resulted in this state of affairs. A power of 10 will be assigned to replace each of these symbols. [For a more complete discussion, see this document.]

empty string: 0
+ or ?: 1
0-9: 10
hH: 100 “hundred”
kK: 1000 “kilo”
mM: 10^6 “million”
bB: 10^9 “billion”
-: NA

# Perform the substitution described above
old <- c("-", "?", "+", "0", "1", "2", "3", "4", "5", "6", "7", "8", "B", "H", "K", "M", "")
new <- c(NA,   1,   1,   10,  10,  10,  10,  10,  10,  10,  10,  10, 1e9, 1e2, 1e3, 1e6, 0)

ctmp <- toupper(df$cropdmgexp)
ptmp <- toupper(df$propdmgexp)
cnew <- rep(NA_integer_, length(ctmp)) # new vectors will be integers
pnew <- rep(NA_integer_, length(ptmp))

for (i in 1:length(old)) {
    cnew[ctmp == old[[i]]] <- new[[i]]
    pnew[ptmp == old[[i]]] <- new[[i]]
}

df$cropdmgexp <- cnew
df$propdmgexp <- pnew
rm(ctmp, ptmp, cnew, pnew, old, new, i)

2.2 Fixing `evtype`

evtype may be the most important variable in our dataset. It is the variable that categorizes each event as tornado, flood and so forth. Later we’ll use evtype to aggregate human and financial damage for each type of weather event and it’s important that they reflect the categories we think they do.

In the NOAA documentation there are 48 official evtype values, corresponding to the various weather events they wish to record data for.

In the dataset however, 883 unique values exist after standardizing to lowercase and removing extra whitespace. An examination of the values shows this to have probably resulted from manual data entry. For example, here we list types relating to blizzards.

df %>% 
    count(evtype) %>% 
    arrange(desc(n)) %>% 
    filter(str_detect(evtype, "blizzard")) %>% 
    head(n=7) %>% 
    kable(caption = 'Counts of blizzard-related evtypes')

Counts of blizzard-related evtypes
evtype	n
blizzard	2,719
high wind/blizzard	6
heavy snow/blizzard	3
blizzard and extreme wind chil	2
blizzard/heavy snow	2
ground blizzard	2
blizzard and heavy snow	1

We can see that "blizzard" labels most of the observations, with a small number of other observations given similar-sounding labels. These observations should probably be a part of "blizzard". With over 800 types in the dataset, We’d like to know if this example represents a systematic problem.

2.2.1 What proportion of `evtypes` are problematic?

We constructed a list of official evtypes from NOAA documentation and stored it in evtypes.txt. We wish to know what proportion of records have a non-standard evtype value?

types <- tolower(readLines("evtypes.txt")) # a vector of standard evtypes
df %>% filter(! evtype %in% types) %>% nrow() / nrow(df)

## [1] 0.2958472

About 30% of the data is using some other label for evtype. This is a significant number and could certainly skew our analysis. e.g. if one major event type were split into many smaller types because of typos, it might no longer appear to be a major cause of damage.

2.2.2 How many non-standard `evtypes` do we need to fix?

What we don’t know yet is how these 30% of records are distributed among non-standard evtypes. i.e. Are they evenly distributed amongst the 800 or so incorrect evtypes (a worst-case scenario) or do a small number of strings represent the majority of these problematic observations?

We’ll group non-standard observations by their evtype and count the observations. We’ll also sort these counts and calculate a cumulative proportion of the observations.

df %>% 
    filter(! evtype %in% types) %>% 
    count(evtype) %>% 
    arrange(desc(n)) %>%  
    mutate(proportion = n / sum(n), cumulative = cumsum(proportion)) %>% 
    slice_max(n, n=4) %>% kable(caption = "Top four errors in evtype")

Top four errors in evtype
evtype	n	proportion	cumulative
tstm wind	219,946	0.8239468	0.8239468
thunderstorm winds	20,850	0.0781069	0.9020536
marine tstm wind	6,175	0.0231324	0.9251860
urban/sml stream fld	3,392	0.0127069	0.9378929

The four most common errors represent almost 94% of all errors.

Let’s pick a nice round number of observations as a cutoff value. We’ll count how many non-standard evtype strings label 100 or more observations in the dataset. As a reminder, there are 902297 observations in total.

fixtypes <- df %>% 
    filter(! evtype %in% types) %>%
    count(evtype) %>% 
    arrange(desc(n)) %>% 
    filter(n >= 100) %>% 
    pull(evtype)
length(fixtypes)

## [1] 32

This is good news, we won’t have to fix over 800 errors. Fixing the 32 most common should ensure a much more accurate analysis. Let’s calculate what proportion of observations will be correct, if we fix evtypes with 100 or more observations.

df %>% 
    filter(! evtype %in% types) %>% 
    count(evtype) %>% 
    arrange(desc(n)) %>% 
    mutate(proportion = n / sum(n), cumulative = cumsum(proportion)) %>% 
    filter(n > 99, n <=101) %>% kable()

evtype	n	proportion	cumulative
moderate snowfall	101	0.0003784	0.9843337

98.4% of our problematic records will be corrected after fixing the top 32. Let’s visualize the situation by plotting the cumulative proportion of correct evtypes.

x <- df %>% 
    filter(! evtype %in% types) %>% 
    count(evtype) %>% 
    arrange(desc(n)) %>%  
    mutate(proportion = n / sum(n), cumulative = cumsum(proportion)) %>% 
    head(32) 
tmp <- x$cumulative
names(tmp) <- x$evtype
par(mar = c(9,4,2,2), cex.axis = .8)
barplot(tmp, las=2, ylim=c(0,1),
         main="Cumulative proportion correct, after fixing evtype errors")
abline(h=1, lty=2, col='dodgerblue2', lwd=2.5)

Fixing tstm wind alone will correct over 80% of the records, and then we see rapidly diminishing returns.

Since evtype errors represent about 30% of all records, correcting these errors to about 98.4% accuracy (the top 32 errors) will result in an overall evtype accuracy of about (1 - 0.3 * (1 - .984)) * 100 ~= 99.5% for the entire dataset. Further improvements would be via vastly diminished returns, so we’ll consider this an acceptable improvement over the original data at about 70% correct.

2.2.3 Fixing the selected `evtypes`

We will work through these 32 types systematically and assign them the correct evtype, using the NOAA documentation and some common sense to place them in the correct category.

The rest of the non-standard labels we will leave unchanged, assuming they have too small an effect on the analysis to matter. Let’s first look at the labels we’ll fix.

kable(matrix(fixtypes, nrow = 8), caption = "The 32 most important evtypes to correct")

The 32 most important evtypes to correct
tstm wind	flash flooding	storm surge	light snow
thunderstorm winds	extreme cold	freezing rain	hurricane
marine tstm wind	flood/flash flood	urban flood	river flood
urban/sml stream fld	snow	heavy surf/high surf	record warmth
high winds	landslide	extreme windchill	unseasonably warm
wild/forest fire	fog	strong winds	flooding
winter weather/mix	wind	dry microburst	astronomical high tide
tstm wind/hail	rip currents	coastal flooding	moderate snowfall

Now we’ll fix these incorrect labels by reassigning the observation to the correct evtype.

# map official evtypes to a vector of some incorrect names for that type
fixlist <- list(
    "thunderstorm wind" = c("tstm wind", "thunderstorm winds", "tstm wind/hail", 
                            "wind", "dry microburst"),
    "marine thunderstorm wind" = c("marine tstm wind"),
    "flood" = c("urban/sml stream fld", "flood/flash flood", "urban flood",
                "river flood", "flooding"),
    "flash flood" = c("flash flooding"),
    "coastal flood" = c("coastal flooding"),
    "high wind" = c("high winds"),
    "strong wind" = c("strong winds"),
    "wildfire" = c("wild/forest fire"),
    "winter weather" = c("winter weather/mix"),
    "extreme cold/wind chill" = c("extreme cold", "extreme windchill"),
    "heavy snow" = c("snow", "light snow", "moderate snowfall"),
    "debris flow" = c("landslide"),
    "dense fog" = c("fog"),
    "rip current" = c("rip currents"),
    "excessive heat" = c("unseasonably warm", "record warmth"),
    # from docs: "..caused by any combination of...high astronomical tide..."
    "storm surge/tide" = c("storm surge", "astronomical high tide"),  
    "high surf" = c("heavy surf/high surf"),
    "winter weather" = c("freezing rain"),
    "hurricane (typhoon)" = c("hurricane")
)

tmp <- df$evtype

# replace all the bad evtypes with official evtypes
for (officialtype in names(fixlist)) {
    for (badtype in fixlist[[officialtype]]) {
        tmp[grep(pattern = badtype, x = tmp)] <- officialtype
    }
}
df$evtype <- tmp
rm(tmp)

2.2.3.1 How well did the corrections work?

Previously, almost 30% of records had an incorrect evtype. Let’s recalculate that proportion.

df %>% filter(! evtype %in% types) %>% nrow() / nrow(df)

## [1] 0.003188529

Now, only 0.3% of records have some unusual evtype from data entry errors, or in other words evtype is now almost 99.7% correct. Earlier we estimated we’d see 99.5% correct, so this is reasonable.

2.2.3.2 A final test: dollar amounts

Let’s make one more test. It’s possible these small number of observations are actually large in terms of human or financial damage. Let us compare the sum of fatalities, injuries, property damage, and crop damage of the remaining uncorrected records to the whole. If our suppositions are correct, the records with non-standard evtypes will be only a small percentage of the total damage.

results <- df %>% 
    mutate(property = propdmg*propdmgexp, crops = cropdmg*cropdmgexp) %>% 
    group_by("standard_evtypes" = evtype %in% types) %>% 
    summarise(across(c(fatalities, injuries, property, crops),
                     ~sum(., na.rm=TRUE))) %>% 
    t()
results <- results[-1, ]
colnames(results) <- c("sum.non.standard", "sum.standard")
as_tibble(results) %>% 
    mutate(non.standard.proportion = sum.non.standard / (sum.non.standard + sum.standard),
           class = rownames(results)) %>% 
    select(class, everything()) %>% 
    kable(caption = "Proportions of total damage represented by non-standard evtypes")

Proportions of total damage represented by non-standard evtypes
class	sum.non.standard	sum.standard	non.standard.proportion
fatalities	500	14,645	0.0330142
injuries	1,718	138,810	0.0122253
property	5,921,397,600	421,397,322,440	0.0138571
crops	1,289,044,280	47,815,150,230	0.0262512

With these non-standard cases making up at most 3.3% of the total damage, the dataset is in much better condition than it was in raw form. We’ll continue the analysis with this imperfect but much-improved dataset, assuming that this small proportion of error won’t significantly skew our results.

3 Results

3.1 Across the United States, which types of events are most harmful to population health?

Here we sum the fatalities and injuries across evtype, select the ten largest sums, and then plot the results.

top_fatalities <- df %>% 
    group_by(evtype) %>% 
    summarise(fatalities = sum(fatalities, na.rm = TRUE)) %>% 
    arrange(desc(fatalities)) %>% 
    slice_head(n = 10)

top_injuries <- df %>% 
    group_by(evtype) %>% 
    summarise(injuries = sum(injuries, na.rm = TRUE)) %>% 
    arrange(desc(injuries)) %>% 
    slice_head(n = 10)

par(mfrow = c(1,2), mar=c(c(8, 4, 4, 2) + 0.1), cex.axis=.9)
barplot(top_fatalities$fatalities, names.arg = top_fatalities$evtype, las=2,
        main="Fatalities: top causes")
mtext(text = "Data source: U.S. NOAA storm database, 1950-2011", outer = T,
      line = -25, cex = .7)
barplot(top_injuries$injuries, names.arg = top_injuries$evtype, las=2,
        main="Injuries: top causes")

In terms of both fatalities and injuries, tornadoes are the most destructive events to human health in this dataset.

kable(list(data.frame(rank = 1:10), top_fatalities, data.frame(),data.frame(), top_injuries),
      caption = "Top causes: fatalities & injuries")

Top causes: fatalities & injuries

rank
1
2
3
4
5
6
7
8
9
10

evtype	fatalities
tornado	5,633
excessive heat	1,943
thunderstorm wind	1,454
flash flood	978
heat	937
lightning	816
rip current	577
flood	554
avalanche	224
winter storm	206

evtype	injuries
tornado	91,346
thunderstorm wind	11,526
flood	6,902
excessive heat	6,542
lightning	5,230
heat	2,100
ice storm	1,975
flash flood	1,777
wildfire	1,456
hail	1,361

3.1.1 Hurricanes?

There appears to be something missing in the above graphs. Knowing something of destructive US weather events one must ask “Where are the hurricanes?”

There is an important note in the FAQ:

The fatalities, injuries, and damage amounts appearing in tropical cyclone events are attributed only to wind damage experienced in the coastal counties/parishes listed. Other tropical cyclone related events such as tornadoes and flooding are listed within their separate event types.

i.e. the deaths and damage from hurricanes have deliberately been split up into multiple evtypes and will show up as part of floods, lightning, wind, etc. Hurricane-spawned tornadoes are also included in the tornado events graphed above. From the point of view of this dataset, hurricanes may be understood as a ‘meta-event’ that causes other events.

3.1.2 Why is there both `heat` and `excessive heat`?

It is also worth noting that some similar events have different official evtypes, e.g. excessive heat and heat, both of which show up in the graphs above. We examined the NOAA documentation that accompanies the dataset and verified that they are indeed separate, official categories. We will leave the situation as is, assuming the NOAA has good reason to organize their data thus.

3.2 Across the United States, which types of events have the greatest economic consequences?

We proceed as earlier, summing property and crop damage across all evtypes.

top_property_dmg <- df %>% 
    group_by(evtype) %>% 
    summarise(property = sum(propdmg * propdmgexp, na.rm = TRUE)) %>% 
    arrange(desc(property)) %>% 
    slice_head(n = 10)
top_crop_dmg <- df %>% 
    group_by(evtype) %>% 
    summarise(crop = sum(cropdmg * cropdmgexp, na.rm = TRUE)) %>% 
    arrange(desc(crop)) %>% 
    slice_head(n = 10)

par(mfrow=c(1,2), mar=c(c(9, 5, 4, 2) + 0.1), cex.axis=.9, mgp = c(4, 1, 0))
barplot(top_property_dmg$property, names.arg = top_property_dmg$evtype, las=2,
        main="Property damage: top causes", ylab = "US Dollars")
mtext(text = "Data source: U.S. NOAA storm database, 1950-2011", outer = T,
      line = -25, cex = .7)
barplot(top_crop_dmg$crop, names.arg = top_crop_dmg$evtype, las=2,
        main="Crop damage: top causes")

kable(list(data.frame(rank = 1:10), top_property_dmg, data.frame(),data.frame(), 
           top_crop_dmg),
      caption = "Top causes: property & crop damage")

Top causes: property & crop damage

rank
1
2
3
4
5
6
7
8
9
10

evtype	property
flood	150,751,456,324
hurricane (typhoon)	84,656,180,010
tornado	56,937,162,897
storm surge/tide	47,974,149,000
thunderstorm wind	17,749,372,097
flash flood	16,140,865,011
hail	15,732,269,877
wildfire	7,766,963,500
tropical storm	7,703,890,550
winter storm	6,688,497,260

evtype	crop
drought	13,972,566,000
flood	10,853,820,100
hurricane (typhoon)	5,505,292,800
ice storm	5,022,113,500
hail	3,025,954,650
thunderstorm wind	2,159,320,250
flash flood	1,421,317,100
extreme cold/wind chill	1,312,973,000
frost/freeze	1,094,186,000
heavy rain	733,399,800

As mentioned in the previous section, the amounts for hurricane (typhoon) are only for wind damage on coastal communities. Hurricane-caused flooding, etc. are grouped into separate evtypes.

author website: andrewluyt.github.io

To aid reproducibility, here is the session environment at time of publishing.

sessionInfo()

## R version 4.1.2 (2021-11-01)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.3 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/atlas/libblas.so.3.10.3
## LAPACK: /usr/lib/x86_64-linux-gnu/atlas/liblapack.so.3.10.3
## 
## locale:
##  [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_CA.UTF-8        LC_COLLATE=en_CA.UTF-8    
##  [5] LC_MONETARY=en_CA.UTF-8    LC_MESSAGES=en_CA.UTF-8   
##  [7] LC_PAPER=en_CA.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] kableExtra_1.3.4 lubridate_1.8.0  forcats_0.5.1    stringr_1.4.0   
##  [5] dplyr_1.0.7      purrr_0.3.4      readr_2.1.1      tidyr_1.1.4     
##  [9] tibble_3.1.6     ggplot2_3.3.5    tidyverse_1.3.1 
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.7        svglite_2.0.0     assertthat_0.2.1  digest_0.6.29    
##  [5] utf8_1.2.2        R6_2.5.1          cellranger_1.1.0  backports_1.4.1  
##  [9] reprex_2.0.1      evaluate_0.14     httr_1.4.2        highr_0.9        
## [13] pillar_1.6.4      rlang_0.4.12      readxl_1.3.1      rstudioapi_0.13  
## [17] jquerylib_0.1.4   rmarkdown_2.11    webshot_0.5.2     munsell_0.5.0    
## [21] broom_0.7.11      compiler_4.1.2    modelr_0.1.8      xfun_0.29        
## [25] pkgconfig_2.0.3   systemfonts_1.0.3 htmltools_0.5.2   tidyselect_1.1.1 
## [29] codetools_0.2-18  fansi_0.5.0       viridisLite_0.4.0 crayon_1.4.2     
## [33] tzdb_0.2.0        dbplyr_2.1.1      withr_2.4.3       grid_4.1.2       
## [37] jsonlite_1.7.2    gtable_0.3.0      lifecycle_1.0.1   DBI_1.1.2        
## [41] magrittr_2.0.1    scales_1.1.1      cli_3.1.0         stringi_1.7.6    
## [45] fs_1.5.2          xml2_1.3.3        ellipsis_0.3.2    generics_0.1.1   
## [49] vctrs_0.3.8       tools_4.1.2       glue_1.6.0        hms_1.1.1        
## [53] fastmap_1.1.0     yaml_2.2.1        colorspace_2.0-2  rvest_1.0.2      
## [57] knitr_1.37        haven_2.4.3

Harmful Effects of Severe Weather, 1950 - 2011

Andrew Luyt

1 Synopsis

2 Data Processing

2.1 Correcting `propdmgexp` and `cropdmgexp`

2.2 Fixing `evtype`

2.2.1 What proportion of `evtypes` are problematic?

2.2.2 How many non-standard `evtypes` do we need to fix?

2.2.3 Fixing the selected `evtypes`

2.2.3.1 How well did the corrections work?

2.2.3.2 A final test: dollar amounts

3 Results

3.1 Across the United States, which types of events are most harmful to population health?

3.1.1 Hurricanes?

3.1.2 Why is there both `heat` and `excessive heat`?

3.2 Across the United States, which types of events have the greatest economic consequences?

Harmful Effects of Severe Weather, 1950 - 2011

Andrew Luyt

1 Synopsis

2 Data Processing

2.1 Correcting propdmgexp and cropdmgexp

2.2 Fixing evtype

2.2.1 What proportion of evtypes are problematic?

2.2.2 How many non-standard evtypes do we need to fix?

2.2.3 Fixing the selected evtypes

2.2.3.1 How well did the corrections work?

2.2.3.2 A final test: dollar amounts

3 Results

3.1 Across the United States, which types of events are most harmful to population health?

3.1.1 Hurricanes?

3.1.2 Why is there both heat and excessive heat?

3.2 Across the United States, which types of events have the greatest economic consequences?

2.1 Correcting `propdmgexp` and `cropdmgexp`

2.2 Fixing `evtype`

2.2.1 What proportion of `evtypes` are problematic?

2.2.2 How many non-standard `evtypes` do we need to fix?

2.2.3 Fixing the selected `evtypes`

3.1.2 Why is there both `heat` and `excessive heat`?