Most recent update: 2022-01-15

1 Synopsis

This report is a final project for the course Reproducible Research by Johns Hopkins University.

In this analysis we will report on the top weather-related causes of human death & injury and property & crop damage. To calculate these sums we will use the NOAA’s data set of weather events and associated damage for events in the USA between 1950 and November 2011. We spend substantial effort correcting manual data entry inconsistencies in the evtype variable, and note that damage caused by hurricanes is split up into multiple categories, for instance flood and hail.

We find the most damaging weather event to human health is the tornado, and the most damaging weather events to property and crops are flooding and droughts, respectively.

2 Data Processing

We’ll load libraries first.

# data.table::fread() is also required. We won't load the entire
# package because its namespace overlaps some other functions we use.
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(kableExtra))  # table formatting tools

# Set global table styling options here by overriding kable()
kable <- function(data, ...) {
    knitr::kable(data, format.args = list(big.mark = ","), ...) %>% 
        kable_styling(full_width=FALSE, position="left")
}

Next we load the data and perform some basic data cleaning:

df <- as_tibble(data.table::fread(file = "repdata_data_StormData.csv.bz2"))
df <- rename(df, state_num = STATE__, longitude_e = LONGITUDE_)
names(df) <- tolower(names(df))
# these variables are all 0 and all NA, respectively
df <- select(df, -c(county_end, countyendn))

# standardize `evtype` to lowercase, removing spaces at beginning and end
# and removing extra internal spaces
df$evtype <- tolower(str_squish(df$evtype))

2.1 Correcting propdmgexp and cropdmgexp

In the raw data, damage amounts are represented in an unusual way. They are expressed in scientific notation, with the base and exponent stored in separate variables.

The exponents, propdmgexp and cropdmgexp, are supposed to represent a power of 10, which is multiplied by propdmg or cropdmg respectively to express a final damage amount. However, they actually contain unusual character values:

sort(union(unique(df$propdmgexp), unique(df$cropdmgexp)))
##  [1] ""  "-" "?" "+" "0" "1" "2" "3" "4" "5" "6" "7" "8" "B" "h" "H" "k" "K" "m"
## [20] "M"

After some investigation, it appears varying data entry standards or data entry error has resulted in this state of affairs. A power of 10 will be assigned to replace each of these symbols. [For a more complete discussion, see this document.]

  • empty string: 0
  • + or ?: 1
  • 0-9: 10
  • hH: 100 “hundred”
  • kK: 1000 “kilo”
  • mM: 10^6 “million”
  • bB: 10^9 “billion”
  • -: NA
# Perform the substitution described above
old <- c("-", "?", "+", "0", "1", "2", "3", "4", "5", "6", "7", "8", "B", "H", "K", "M", "")
new <- c(NA,   1,   1,   10,  10,  10,  10,  10,  10,  10,  10,  10, 1e9, 1e2, 1e3, 1e6, 0)

ctmp <- toupper(df$cropdmgexp)
ptmp <- toupper(df$propdmgexp)
cnew <- rep(NA_integer_, length(ctmp)) # new vectors will be integers
pnew <- rep(NA_integer_, length(ptmp))

for (i in 1:length(old)) {
    cnew[ctmp == old[[i]]] <- new[[i]]
    pnew[ptmp == old[[i]]] <- new[[i]]
}

df$cropdmgexp <- cnew
df$propdmgexp <- pnew
rm(ctmp, ptmp, cnew, pnew, old, new, i)

2.2 Fixing evtype

evtype may be the most important variable in our dataset. It is the variable that categorizes each event as tornado, flood and so forth. Later we’ll use evtype to aggregate human and financial damage for each type of weather event and it’s important that they reflect the categories we think they do.

In the NOAA documentation there are 48 official evtype values, corresponding to the various weather events they wish to record data for.

In the dataset however, 883 unique values exist after standardizing to lowercase and removing extra whitespace. An examination of the values shows this to have probably resulted from manual data entry. For example, here we list types relating to blizzards.

df %>% 
    count(evtype) %>% 
    arrange(desc(n)) %>% 
    filter(str_detect(evtype, "blizzard")) %>% 
    head(n=7) %>% 
    kable(caption = 'Counts of blizzard-related evtypes')
Counts of blizzard-related evtypes
evtype n
blizzard 2,719
high wind/blizzard 6
heavy snow/blizzard 3
blizzard and extreme wind chil 2
blizzard/heavy snow 2
ground blizzard 2
blizzard and heavy snow 1

We can see that "blizzard" labels most of the observations, with a small number of other observations given similar-sounding labels. These observations should probably be a part of "blizzard". With over 800 types in the dataset, We’d like to know if this example represents a systematic problem.

2.2.1 What proportion of evtypes are problematic?

We constructed a list of official evtypes from NOAA documentation and stored it in evtypes.txt. We wish to know what proportion of records have a non-standard evtype value?

types <- tolower(readLines("evtypes.txt")) # a vector of standard evtypes
df %>% filter(! evtype %in% types) %>% nrow() / nrow(df)
## [1] 0.2958472

About 30% of the data is using some other label for evtype. This is a significant number and could certainly skew our analysis. e.g. if one major event type were split into many smaller types because of typos, it might no longer appear to be a major cause of damage.

2.2.2 How many non-standard evtypes do we need to fix?

What we don’t know yet is how these 30% of records are distributed among non-standard evtypes. i.e. Are they evenly distributed amongst the 800 or so incorrect evtypes (a worst-case scenario) or do a small number of strings represent the majority of these problematic observations?

We’ll group non-standard observations by their evtype and count the observations. We’ll also sort these counts and calculate a cumulative proportion of the observations.

df %>% 
    filter(! evtype %in% types) %>% 
    count(evtype) %>% 
    arrange(desc(n)) %>%  
    mutate(proportion = n / sum(n), cumulative = cumsum(proportion)) %>% 
    slice_max(n, n=4) %>% kable(caption = "Top four errors in evtype")
Top four errors in evtype
evtype n proportion cumulative
tstm wind 219,946 0.8239468 0.8239468
thunderstorm winds 20,850 0.0781069 0.9020536
marine tstm wind 6,175 0.0231324 0.9251860
urban/sml stream fld 3,392 0.0127069 0.9378929

The four most common errors represent almost 94% of all errors.

Let’s pick a nice round number of observations as a cutoff value. We’ll count how many non-standard evtype strings label 100 or more observations in the dataset. As a reminder, there are 902297 observations in total.

fixtypes <- df %>% 
    filter(! evtype %in% types) %>%
    count(evtype) %>% 
    arrange(desc(n)) %>% 
    filter(n >= 100) %>% 
    pull(evtype)
length(fixtypes)
## [1] 32

This is good news, we won’t have to fix over 800 errors. Fixing the 32 most common should ensure a much more accurate analysis. Let’s calculate what proportion of observations will be correct, if we fix evtypes with 100 or more observations.

df %>% 
    filter(! evtype %in% types) %>% 
    count(evtype) %>% 
    arrange(desc(n)) %>% 
    mutate(proportion = n / sum(n), cumulative = cumsum(proportion)) %>% 
    filter(n > 99, n <=101) %>% kable()
evtype n proportion cumulative
moderate snowfall 101 0.0003784 0.9843337

98.4% of our problematic records will be corrected after fixing the top 32. Let’s visualize the situation by plotting the cumulative proportion of correct evtypes.

x <- df %>% 
    filter(! evtype %in% types) %>% 
    count(evtype) %>% 
    arrange(desc(n)) %>%  
    mutate(proportion = n / sum(n), cumulative = cumsum(proportion)) %>% 
    head(32) 
tmp <- x$cumulative
names(tmp) <- x$evtype
par(mar = c(9,4,2,2), cex.axis = .8)
barplot(tmp, las=2, ylim=c(0,1),
         main="Cumulative proportion correct, after fixing evtype errors")
abline(h=1, lty=2, col='dodgerblue2', lwd=2.5)

Fixing tstm wind alone will correct over 80% of the records, and then we see rapidly diminishing returns.

Since evtype errors represent about 30% of all records, correcting these errors to about 98.4% accuracy (the top 32 errors) will result in an overall evtype accuracy of about (1 - 0.3 * (1 - .984)) * 100 ~= 99.5% for the entire dataset. Further improvements would be via vastly diminished returns, so we’ll consider this an acceptable improvement over the original data at about 70% correct.

2.2.3 Fixing the selected evtypes

We will work through these 32 types systematically and assign them the correct evtype, using the NOAA documentation and some common sense to place them in the correct category.

The rest of the non-standard labels we will leave unchanged, assuming they have too small an effect on the analysis to matter. Let’s first look at the labels we’ll fix.

kable(matrix(fixtypes, nrow = 8), caption = "The 32 most important evtypes to correct")
The 32 most important evtypes to correct
tstm wind flash flooding storm surge light snow
thunderstorm winds extreme cold freezing rain hurricane
marine tstm wind flood/flash flood urban flood river flood
urban/sml stream fld snow heavy surf/high surf record warmth
high winds landslide extreme windchill unseasonably warm
wild/forest fire fog strong winds flooding
winter weather/mix wind dry microburst astronomical high tide
tstm wind/hail rip currents coastal flooding moderate snowfall

Now we’ll fix these incorrect labels by reassigning the observation to the correct evtype.

# map official evtypes to a vector of some incorrect names for that type
fixlist <- list(
    "thunderstorm wind" = c("tstm wind", "thunderstorm winds", "tstm wind/hail", 
                            "wind", "dry microburst"),
    "marine thunderstorm wind" = c("marine tstm wind"),
    "flood" = c("urban/sml stream fld", "flood/flash flood", "urban flood",
                "river flood", "flooding"),
    "flash flood" = c("flash flooding"),
    "coastal flood" = c("coastal flooding"),
    "high wind" = c("high winds"),
    "strong wind" = c("strong winds"),
    "wildfire" = c("wild/forest fire"),
    "winter weather" = c("winter weather/mix"),
    "extreme cold/wind chill" = c("extreme cold", "extreme windchill"),
    "heavy snow" = c("snow", "light snow", "moderate snowfall"),
    "debris flow" = c("landslide"),
    "dense fog" = c("fog"),
    "rip current" = c("rip currents"),
    "excessive heat" = c("unseasonably warm", "record warmth"),
    # from docs: "..caused by any combination of...high astronomical tide..."
    "storm surge/tide" = c("storm surge", "astronomical high tide"),  
    "high surf" = c("heavy surf/high surf"),
    "winter weather" = c("freezing rain"),
    "hurricane (typhoon)" = c("hurricane")
)

tmp <- df$evtype

# replace all the bad evtypes with official evtypes
for (officialtype in names(fixlist)) {
    for (badtype in fixlist[[officialtype]]) {
        tmp[grep(pattern = badtype, x = tmp)] <- officialtype
    }
}
df$evtype <- tmp
rm(tmp)

2.2.3.1 How well did the corrections work?

Previously, almost 30% of records had an incorrect evtype. Let’s recalculate that proportion.

df %>% filter(! evtype %in% types) %>% nrow() / nrow(df)
## [1] 0.003188529

Now, only 0.3% of records have some unusual evtype from data entry errors, or in other words evtype is now almost 99.7% correct. Earlier we estimated we’d see 99.5% correct, so this is reasonable.

2.2.3.2 A final test: dollar amounts

Let’s make one more test. It’s possible these small number of observations are actually large in terms of human or financial damage. Let us compare the sum of fatalities, injuries, property damage, and crop damage of the remaining uncorrected records to the whole. If our suppositions are correct, the records with non-standard evtypes will be only a small percentage of the total damage.

results <- df %>% 
    mutate(property = propdmg*propdmgexp, crops = cropdmg*cropdmgexp) %>% 
    group_by("standard_evtypes" = evtype %in% types) %>% 
    summarise(across(c(fatalities, injuries, property, crops),
                     ~sum(., na.rm=TRUE))) %>% 
    t()
results <- results[-1, ]
colnames(results) <- c("sum.non.standard", "sum.standard")
as_tibble(results) %>% 
    mutate(non.standard.proportion = sum.non.standard / (sum.non.standard + sum.standard),
           class = rownames(results)) %>% 
    select(class, everything()) %>% 
    kable(caption = "Proportions of total damage represented by non-standard evtypes")
Proportions of total damage represented by non-standard evtypes
class sum.non.standard sum.standard non.standard.proportion
fatalities 500 14,645 0.0330142
injuries 1,718 138,810 0.0122253
property 5,921,397,600 421,397,322,440 0.0138571
crops 1,289,044,280 47,815,150,230 0.0262512

With these non-standard cases making up at most 3.3% of the total damage, the dataset is in much better condition than it was in raw form. We’ll continue the analysis with this imperfect but much-improved dataset, assuming that this small proportion of error won’t significantly skew our results.

3 Results

3.1 Across the United States, which types of events are most harmful to population health?

Here we sum the fatalities and injuries across evtype, select the ten largest sums, and then plot the results.

top_fatalities <- df %>% 
    group_by(evtype) %>% 
    summarise(fatalities = sum(fatalities, na.rm = TRUE)) %>% 
    arrange(desc(fatalities)) %>% 
    slice_head(n = 10)

top_injuries <- df %>% 
    group_by(evtype) %>% 
    summarise(injuries = sum(injuries, na.rm = TRUE)) %>% 
    arrange(desc(injuries)) %>% 
    slice_head(n = 10)
par(mfrow = c(1,2), mar=c(c(8, 4, 4, 2) + 0.1), cex.axis=.9)
barplot(top_fatalities$fatalities, names.arg = top_fatalities$evtype, las=2,
        main="Fatalities: top causes")
mtext(text = "Data source: U.S. NOAA storm database, 1950-2011", outer = T,
      line = -25, cex = .7)
barplot(top_injuries$injuries, names.arg = top_injuries$evtype, las=2,
        main="Injuries: top causes")

In terms of both fatalities and injuries, tornadoes are the most destructive events to human health in this dataset.

kable(list(data.frame(rank = 1:10), top_fatalities, data.frame(),data.frame(), top_injuries),
      caption = "Top causes: fatalities & injuries")
Top causes: fatalities & injuries
rank
1
2
3
4
5
6
7
8
9
10
evtype fatalities
tornado 5,633
excessive heat 1,943
thunderstorm wind 1,454
flash flood 978
heat 937
lightning 816
rip current 577
flood 554
avalanche 224
winter storm 206
evtype injuries
tornado 91,346
thunderstorm wind 11,526
flood 6,902
excessive heat 6,542
lightning 5,230
heat 2,100
ice storm 1,975
flash flood 1,777
wildfire 1,456
hail 1,361

3.1.1 Hurricanes?

There appears to be something missing in the above graphs. Knowing something of destructive US weather events one must ask “Where are the hurricanes?”

There is an important note in the FAQ:

The fatalities, injuries, and damage amounts appearing in tropical cyclone events are attributed only to wind damage experienced in the coastal counties/parishes listed. Other tropical cyclone related events such as tornadoes and flooding are listed within their separate event types.

i.e. the deaths and damage from hurricanes have deliberately been split up into multiple evtypes and will show up as part of floods, lightning, wind, etc. Hurricane-spawned tornadoes are also included in the tornado events graphed above. From the point of view of this dataset, hurricanes may be understood as a ‘meta-event’ that causes other events.

3.1.2 Why is there both heat and excessive heat?

It is also worth noting that some similar events have different official evtypes, e.g. excessive heat and heat, both of which show up in the graphs above. We examined the NOAA documentation that accompanies the dataset and verified that they are indeed separate, official categories. We will leave the situation as is, assuming the NOAA has good reason to organize their data thus.

3.2 Across the United States, which types of events have the greatest economic consequences?

We proceed as earlier, summing property and crop damage across all evtypes.

top_property_dmg <- df %>% 
    group_by(evtype) %>% 
    summarise(property = sum(propdmg * propdmgexp, na.rm = TRUE)) %>% 
    arrange(desc(property)) %>% 
    slice_head(n = 10)
top_crop_dmg <- df %>% 
    group_by(evtype) %>% 
    summarise(crop = sum(cropdmg * cropdmgexp, na.rm = TRUE)) %>% 
    arrange(desc(crop)) %>% 
    slice_head(n = 10)

par(mfrow=c(1,2), mar=c(c(9, 5, 4, 2) + 0.1), cex.axis=.9, mgp = c(4, 1, 0))
barplot(top_property_dmg$property, names.arg = top_property_dmg$evtype, las=2,
        main="Property damage: top causes", ylab = "US Dollars")
mtext(text = "Data source: U.S. NOAA storm database, 1950-2011", outer = T,
      line = -25, cex = .7)
barplot(top_crop_dmg$crop, names.arg = top_crop_dmg$evtype, las=2,
        main="Crop damage: top causes")

kable(list(data.frame(rank = 1:10), top_property_dmg, data.frame(),data.frame(), 
           top_crop_dmg),
      caption = "Top causes: property & crop damage")
Top causes: property & crop damage
rank
1
2
3
4
5
6
7
8
9
10
evtype property
flood 150,751,456,324
hurricane (typhoon) 84,656,180,010
tornado 56,937,162,897
storm surge/tide 47,974,149,000
thunderstorm wind 17,749,372,097
flash flood 16,140,865,011
hail 15,732,269,877
wildfire 7,766,963,500
tropical storm 7,703,890,550
winter storm 6,688,497,260
evtype crop
drought 13,972,566,000
flood 10,853,820,100
hurricane (typhoon) 5,505,292,800
ice storm 5,022,113,500
hail 3,025,954,650
thunderstorm wind 2,159,320,250
flash flood 1,421,317,100
extreme cold/wind chill 1,312,973,000
frost/freeze 1,094,186,000
heavy rain 733,399,800

As mentioned in the previous section, the amounts for hurricane (typhoon) are only for wind damage on coastal communities. Hurricane-caused flooding, etc. are grouped into separate evtypes.


author website: andrewluyt.github.io

To aid reproducibility, here is the session environment at time of publishing.

sessionInfo()
## R version 4.1.2 (2021-11-01)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.3 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/atlas/libblas.so.3.10.3
## LAPACK: /usr/lib/x86_64-linux-gnu/atlas/liblapack.so.3.10.3
## 
## locale:
##  [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_CA.UTF-8        LC_COLLATE=en_CA.UTF-8    
##  [5] LC_MONETARY=en_CA.UTF-8    LC_MESSAGES=en_CA.UTF-8   
##  [7] LC_PAPER=en_CA.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] kableExtra_1.3.4 lubridate_1.8.0  forcats_0.5.1    stringr_1.4.0   
##  [5] dplyr_1.0.7      purrr_0.3.4      readr_2.1.1      tidyr_1.1.4     
##  [9] tibble_3.1.6     ggplot2_3.3.5    tidyverse_1.3.1 
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.7        svglite_2.0.0     assertthat_0.2.1  digest_0.6.29    
##  [5] utf8_1.2.2        R6_2.5.1          cellranger_1.1.0  backports_1.4.1  
##  [9] reprex_2.0.1      evaluate_0.14     httr_1.4.2        highr_0.9        
## [13] pillar_1.6.4      rlang_0.4.12      readxl_1.3.1      rstudioapi_0.13  
## [17] jquerylib_0.1.4   rmarkdown_2.11    webshot_0.5.2     munsell_0.5.0    
## [21] broom_0.7.11      compiler_4.1.2    modelr_0.1.8      xfun_0.29        
## [25] pkgconfig_2.0.3   systemfonts_1.0.3 htmltools_0.5.2   tidyselect_1.1.1 
## [29] codetools_0.2-18  fansi_0.5.0       viridisLite_0.4.0 crayon_1.4.2     
## [33] tzdb_0.2.0        dbplyr_2.1.1      withr_2.4.3       grid_4.1.2       
## [37] jsonlite_1.7.2    gtable_0.3.0      lifecycle_1.0.1   DBI_1.1.2        
## [41] magrittr_2.0.1    scales_1.1.1      cli_3.1.0         stringi_1.7.6    
## [45] fs_1.5.2          xml2_1.3.3        ellipsis_0.3.2    generics_0.1.1   
## [49] vctrs_0.3.8       tools_4.1.2       glue_1.6.0        hms_1.1.1        
## [53] fastmap_1.1.0     yaml_2.2.1        colorspace_2.0-2  rvest_1.0.2      
## [57] knitr_1.37        haven_2.4.3