Most recent update: 2022-01-15
This report is a final project for the course Reproducible Research by Johns Hopkins University.
In this analysis we will report on the top weather-related causes of human death & injury and property & crop damage. To calculate these sums we will use the NOAA’s data set of weather events and associated damage for events in the USA between 1950 and November 2011. We spend substantial effort correcting manual data entry inconsistencies in the evtype variable, and note that damage caused by hurricanes is split up into multiple categories, for instance flood and hail.
We find the most damaging weather event to human health is the tornado, and the most damaging weather events to property and crops are flooding and droughts, respectively.
We’ll load libraries first.
# data.table::fread() is also required. We won't load the entire
# package because its namespace overlaps some other functions we use.
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(kableExtra)) # table formatting tools
# Set global table styling options here by overriding kable()
kable <- function(data, ...) {
knitr::kable(data, format.args = list(big.mark = ","), ...) %>%
kable_styling(full_width=FALSE, position="left")
}Next we load the data and perform some basic data cleaning:
county_end and countyendn are all 0 and NA, respectively.evtype to lowercase with no extra spaces.
df <- as_tibble(data.table::fread(file = "repdata_data_StormData.csv.bz2"))
df <- rename(df, state_num = STATE__, longitude_e = LONGITUDE_)
names(df) <- tolower(names(df))
# these variables are all 0 and all NA, respectively
df <- select(df, -c(county_end, countyendn))
# standardize `evtype` to lowercase, removing spaces at beginning and end
# and removing extra internal spaces
df$evtype <- tolower(str_squish(df$evtype))propdmgexp and cropdmgexpIn the raw data, damage amounts are represented in an unusual way. They are expressed in scientific notation, with the base and exponent stored in separate variables.
The exponents, propdmgexp and cropdmgexp, are supposed to represent a power of 10, which is multiplied by propdmg or cropdmg respectively to express a final damage amount. However, they actually contain unusual character values:
sort(union(unique(df$propdmgexp), unique(df$cropdmgexp)))## [1] "" "-" "?" "+" "0" "1" "2" "3" "4" "5" "6" "7" "8" "B" "h" "H" "k" "K" "m"
## [20] "M"
After some investigation, it appears varying data entry standards or data entry error has resulted in this state of affairs. A power of 10 will be assigned to replace each of these symbols. [For a more complete discussion, see this document.]
empty string: 0+ or ?: 10-9: 10hH: 100 “hundred”kK: 1000 “kilo”mM: 10^6 “million”bB: 10^9 “billion”-: NA# Perform the substitution described above
old <- c("-", "?", "+", "0", "1", "2", "3", "4", "5", "6", "7", "8", "B", "H", "K", "M", "")
new <- c(NA, 1, 1, 10, 10, 10, 10, 10, 10, 10, 10, 10, 1e9, 1e2, 1e3, 1e6, 0)
ctmp <- toupper(df$cropdmgexp)
ptmp <- toupper(df$propdmgexp)
cnew <- rep(NA_integer_, length(ctmp)) # new vectors will be integers
pnew <- rep(NA_integer_, length(ptmp))
for (i in 1:length(old)) {
cnew[ctmp == old[[i]]] <- new[[i]]
pnew[ptmp == old[[i]]] <- new[[i]]
}
df$cropdmgexp <- cnew
df$propdmgexp <- pnew
rm(ctmp, ptmp, cnew, pnew, old, new, i)evtypeevtype may be the most important variable in our dataset. It is the variable that categorizes each event as tornado, flood and so forth. Later we’ll use evtype to aggregate human and financial damage for each type of weather event and it’s important that they reflect the categories we think they do.
In the NOAA documentation there are 48 official evtype values, corresponding to the various weather events they wish to record data for.
In the dataset however, 883 unique values exist after standardizing to lowercase and removing extra whitespace. An examination of the values shows this to have probably resulted from manual data entry. For example, here we list types relating to blizzards.
df %>%
count(evtype) %>%
arrange(desc(n)) %>%
filter(str_detect(evtype, "blizzard")) %>%
head(n=7) %>%
kable(caption = 'Counts of blizzard-related evtypes')| evtype | n |
|---|---|
| blizzard | 2,719 |
| high wind/blizzard | 6 |
| heavy snow/blizzard | 3 |
| blizzard and extreme wind chil | 2 |
| blizzard/heavy snow | 2 |
| ground blizzard | 2 |
| blizzard and heavy snow | 1 |
We can see that "blizzard" labels most of the observations, with a small number of other observations given similar-sounding labels. These observations should probably be a part of "blizzard". With over 800 types in the dataset, We’d like to know if this example represents a systematic problem.
evtypes are problematic?We constructed a list of official evtypes from NOAA documentation and stored it in evtypes.txt. We wish to know what proportion of records have a non-standard evtype value?
types <- tolower(readLines("evtypes.txt")) # a vector of standard evtypes
df %>% filter(! evtype %in% types) %>% nrow() / nrow(df)## [1] 0.2958472
About 30% of the data is using some other label for evtype. This is a significant number and could certainly skew our analysis. e.g. if one major event type were split into many smaller types because of typos, it might no longer appear to be a major cause of damage.
evtypes do we need to fix?What we don’t know yet is how these 30% of records are distributed among non-standard evtypes. i.e. Are they evenly distributed amongst the 800 or so incorrect evtypes (a worst-case scenario) or do a small number of strings represent the majority of these problematic observations?
We’ll group non-standard observations by their evtype and count the observations. We’ll also sort these counts and calculate a cumulative proportion of the observations.
df %>%
filter(! evtype %in% types) %>%
count(evtype) %>%
arrange(desc(n)) %>%
mutate(proportion = n / sum(n), cumulative = cumsum(proportion)) %>%
slice_max(n, n=4) %>% kable(caption = "Top four errors in evtype")| evtype | n | proportion | cumulative |
|---|---|---|---|
| tstm wind | 219,946 | 0.8239468 | 0.8239468 |
| thunderstorm winds | 20,850 | 0.0781069 | 0.9020536 |
| marine tstm wind | 6,175 | 0.0231324 | 0.9251860 |
| urban/sml stream fld | 3,392 | 0.0127069 | 0.9378929 |
The four most common errors represent almost 94% of all errors.
Let’s pick a nice round number of observations as a cutoff value. We’ll count how many non-standard evtype strings label 100 or more observations in the dataset. As a reminder, there are 902297 observations in total.
fixtypes <- df %>%
filter(! evtype %in% types) %>%
count(evtype) %>%
arrange(desc(n)) %>%
filter(n >= 100) %>%
pull(evtype)
length(fixtypes)## [1] 32
This is good news, we won’t have to fix over 800 errors. Fixing the 32 most common should ensure a much more accurate analysis. Let’s calculate what proportion of observations will be correct, if we fix evtypes with 100 or more observations.
df %>%
filter(! evtype %in% types) %>%
count(evtype) %>%
arrange(desc(n)) %>%
mutate(proportion = n / sum(n), cumulative = cumsum(proportion)) %>%
filter(n > 99, n <=101) %>% kable()| evtype | n | proportion | cumulative |
|---|---|---|---|
| moderate snowfall | 101 | 0.0003784 | 0.9843337 |
98.4% of our problematic records will be corrected after fixing the top 32. Let’s visualize the situation by plotting the cumulative proportion of correct evtypes.
x <- df %>%
filter(! evtype %in% types) %>%
count(evtype) %>%
arrange(desc(n)) %>%
mutate(proportion = n / sum(n), cumulative = cumsum(proportion)) %>%
head(32)
tmp <- x$cumulative
names(tmp) <- x$evtype
par(mar = c(9,4,2,2), cex.axis = .8)
barplot(tmp, las=2, ylim=c(0,1),
main="Cumulative proportion correct, after fixing evtype errors")
abline(h=1, lty=2, col='dodgerblue2', lwd=2.5)Fixing tstm wind alone will correct over 80% of the records, and then we see rapidly diminishing returns.
Since evtype errors represent about 30% of all records, correcting these errors to about 98.4% accuracy (the top 32 errors) will result in an overall evtype accuracy of about (1 - 0.3 * (1 - .984)) * 100 ~= 99.5% for the entire dataset. Further improvements would be via vastly diminished returns, so we’ll consider this an acceptable improvement over the original data at about 70% correct.
evtypesWe will work through these 32 types systematically and assign them the correct evtype, using the NOAA documentation and some common sense to place them in the correct category.
The rest of the non-standard labels we will leave unchanged, assuming they have too small an effect on the analysis to matter. Let’s first look at the labels we’ll fix.
kable(matrix(fixtypes, nrow = 8), caption = "The 32 most important evtypes to correct")| tstm wind | flash flooding | storm surge | light snow |
| thunderstorm winds | extreme cold | freezing rain | hurricane |
| marine tstm wind | flood/flash flood | urban flood | river flood |
| urban/sml stream fld | snow | heavy surf/high surf | record warmth |
| high winds | landslide | extreme windchill | unseasonably warm |
| wild/forest fire | fog | strong winds | flooding |
| winter weather/mix | wind | dry microburst | astronomical high tide |
| tstm wind/hail | rip currents | coastal flooding | moderate snowfall |
Now we’ll fix these incorrect labels by reassigning the observation to the correct evtype.
# map official evtypes to a vector of some incorrect names for that type
fixlist <- list(
"thunderstorm wind" = c("tstm wind", "thunderstorm winds", "tstm wind/hail",
"wind", "dry microburst"),
"marine thunderstorm wind" = c("marine tstm wind"),
"flood" = c("urban/sml stream fld", "flood/flash flood", "urban flood",
"river flood", "flooding"),
"flash flood" = c("flash flooding"),
"coastal flood" = c("coastal flooding"),
"high wind" = c("high winds"),
"strong wind" = c("strong winds"),
"wildfire" = c("wild/forest fire"),
"winter weather" = c("winter weather/mix"),
"extreme cold/wind chill" = c("extreme cold", "extreme windchill"),
"heavy snow" = c("snow", "light snow", "moderate snowfall"),
"debris flow" = c("landslide"),
"dense fog" = c("fog"),
"rip current" = c("rip currents"),
"excessive heat" = c("unseasonably warm", "record warmth"),
# from docs: "..caused by any combination of...high astronomical tide..."
"storm surge/tide" = c("storm surge", "astronomical high tide"),
"high surf" = c("heavy surf/high surf"),
"winter weather" = c("freezing rain"),
"hurricane (typhoon)" = c("hurricane")
)
tmp <- df$evtype
# replace all the bad evtypes with official evtypes
for (officialtype in names(fixlist)) {
for (badtype in fixlist[[officialtype]]) {
tmp[grep(pattern = badtype, x = tmp)] <- officialtype
}
}
df$evtype <- tmp
rm(tmp)Previously, almost 30% of records had an incorrect evtype. Let’s recalculate that proportion.
df %>% filter(! evtype %in% types) %>% nrow() / nrow(df)## [1] 0.003188529
Now, only 0.3% of records have some unusual evtype from data entry errors, or in other words evtype is now almost 99.7% correct. Earlier we estimated we’d see 99.5% correct, so this is reasonable.
Let’s make one more test. It’s possible these small number of observations are actually large in terms of human or financial damage. Let us compare the sum of fatalities, injuries, property damage, and crop damage of the remaining uncorrected records to the whole. If our suppositions are correct, the records with non-standard evtypes will be only a small percentage of the total damage.
results <- df %>%
mutate(property = propdmg*propdmgexp, crops = cropdmg*cropdmgexp) %>%
group_by("standard_evtypes" = evtype %in% types) %>%
summarise(across(c(fatalities, injuries, property, crops),
~sum(., na.rm=TRUE))) %>%
t()
results <- results[-1, ]
colnames(results) <- c("sum.non.standard", "sum.standard")
as_tibble(results) %>%
mutate(non.standard.proportion = sum.non.standard / (sum.non.standard + sum.standard),
class = rownames(results)) %>%
select(class, everything()) %>%
kable(caption = "Proportions of total damage represented by non-standard evtypes")| class | sum.non.standard | sum.standard | non.standard.proportion |
|---|---|---|---|
| fatalities | 500 | 14,645 | 0.0330142 |
| injuries | 1,718 | 138,810 | 0.0122253 |
| property | 5,921,397,600 | 421,397,322,440 | 0.0138571 |
| crops | 1,289,044,280 | 47,815,150,230 | 0.0262512 |
With these non-standard cases making up at most 3.3% of the total damage, the dataset is in much better condition than it was in raw form. We’ll continue the analysis with this imperfect but much-improved dataset, assuming that this small proportion of error won’t significantly skew our results.
Here we sum the fatalities and injuries across evtype, select the ten largest sums, and then plot the results.
top_fatalities <- df %>%
group_by(evtype) %>%
summarise(fatalities = sum(fatalities, na.rm = TRUE)) %>%
arrange(desc(fatalities)) %>%
slice_head(n = 10)
top_injuries <- df %>%
group_by(evtype) %>%
summarise(injuries = sum(injuries, na.rm = TRUE)) %>%
arrange(desc(injuries)) %>%
slice_head(n = 10)par(mfrow = c(1,2), mar=c(c(8, 4, 4, 2) + 0.1), cex.axis=.9)
barplot(top_fatalities$fatalities, names.arg = top_fatalities$evtype, las=2,
main="Fatalities: top causes")
mtext(text = "Data source: U.S. NOAA storm database, 1950-2011", outer = T,
line = -25, cex = .7)
barplot(top_injuries$injuries, names.arg = top_injuries$evtype, las=2,
main="Injuries: top causes")In terms of both fatalities and injuries, tornadoes are the most destructive events to human health in this dataset.
kable(list(data.frame(rank = 1:10), top_fatalities, data.frame(),data.frame(), top_injuries),
caption = "Top causes: fatalities & injuries")
|
|
|
|
|
There appears to be something missing in the above graphs. Knowing something of destructive US weather events one must ask “Where are the hurricanes?”
There is an important note in the FAQ:
The fatalities, injuries, and damage amounts appearing in tropical cyclone events are attributed only to wind damage experienced in the coastal counties/parishes listed. Other tropical cyclone related events such as tornadoes and flooding are listed within their separate event types.
i.e. the deaths and damage from hurricanes have deliberately been split up into multiple evtypes and will show up as part of floods, lightning, wind, etc. Hurricane-spawned tornadoes are also included in the tornado events graphed above. From the point of view of this dataset, hurricanes may be understood as a ‘meta-event’ that causes other events.
heat and excessive heat?It is also worth noting that some similar events have different official evtypes, e.g. excessive heat and heat, both of which show up in the graphs above. We examined the NOAA documentation that accompanies the dataset and verified that they are indeed separate, official categories. We will leave the situation as is, assuming the NOAA has good reason to organize their data thus.
We proceed as earlier, summing property and crop damage across all evtypes.
top_property_dmg <- df %>%
group_by(evtype) %>%
summarise(property = sum(propdmg * propdmgexp, na.rm = TRUE)) %>%
arrange(desc(property)) %>%
slice_head(n = 10)
top_crop_dmg <- df %>%
group_by(evtype) %>%
summarise(crop = sum(cropdmg * cropdmgexp, na.rm = TRUE)) %>%
arrange(desc(crop)) %>%
slice_head(n = 10)
par(mfrow=c(1,2), mar=c(c(9, 5, 4, 2) + 0.1), cex.axis=.9, mgp = c(4, 1, 0))
barplot(top_property_dmg$property, names.arg = top_property_dmg$evtype, las=2,
main="Property damage: top causes", ylab = "US Dollars")
mtext(text = "Data source: U.S. NOAA storm database, 1950-2011", outer = T,
line = -25, cex = .7)
barplot(top_crop_dmg$crop, names.arg = top_crop_dmg$evtype, las=2,
main="Crop damage: top causes")kable(list(data.frame(rank = 1:10), top_property_dmg, data.frame(),data.frame(),
top_crop_dmg),
caption = "Top causes: property & crop damage")
|
|
|
|
|
As mentioned in the previous section, the amounts for hurricane (typhoon) are only for wind damage on coastal communities. Hurricane-caused flooding, etc. are grouped into separate evtypes.
To aid reproducibility, here is the session environment at time of publishing.
sessionInfo()## R version 4.1.2 (2021-11-01)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.3 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/atlas/libblas.so.3.10.3
## LAPACK: /usr/lib/x86_64-linux-gnu/atlas/liblapack.so.3.10.3
##
## locale:
## [1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_CA.UTF-8 LC_COLLATE=en_CA.UTF-8
## [5] LC_MONETARY=en_CA.UTF-8 LC_MESSAGES=en_CA.UTF-8
## [7] LC_PAPER=en_CA.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] kableExtra_1.3.4 lubridate_1.8.0 forcats_0.5.1 stringr_1.4.0
## [5] dplyr_1.0.7 purrr_0.3.4 readr_2.1.1 tidyr_1.1.4
## [9] tibble_3.1.6 ggplot2_3.3.5 tidyverse_1.3.1
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.7 svglite_2.0.0 assertthat_0.2.1 digest_0.6.29
## [5] utf8_1.2.2 R6_2.5.1 cellranger_1.1.0 backports_1.4.1
## [9] reprex_2.0.1 evaluate_0.14 httr_1.4.2 highr_0.9
## [13] pillar_1.6.4 rlang_0.4.12 readxl_1.3.1 rstudioapi_0.13
## [17] jquerylib_0.1.4 rmarkdown_2.11 webshot_0.5.2 munsell_0.5.0
## [21] broom_0.7.11 compiler_4.1.2 modelr_0.1.8 xfun_0.29
## [25] pkgconfig_2.0.3 systemfonts_1.0.3 htmltools_0.5.2 tidyselect_1.1.1
## [29] codetools_0.2-18 fansi_0.5.0 viridisLite_0.4.0 crayon_1.4.2
## [33] tzdb_0.2.0 dbplyr_2.1.1 withr_2.4.3 grid_4.1.2
## [37] jsonlite_1.7.2 gtable_0.3.0 lifecycle_1.0.1 DBI_1.1.2
## [41] magrittr_2.0.1 scales_1.1.1 cli_3.1.0 stringi_1.7.6
## [45] fs_1.5.2 xml2_1.3.3 ellipsis_0.3.2 generics_0.1.1
## [49] vctrs_0.3.8 tools_4.1.2 glue_1.6.0 hms_1.1.1
## [53] fastmap_1.1.0 yaml_2.2.1 colorspace_2.0-2 rvest_1.0.2
## [57] knitr_1.37 haven_2.4.3