This report aims to determine the severe weather events that cause the most problems in terms of public health and economic impact. The conclusion is that tornadoes are the most injurous events with heat being the main cause of death. The most damaging events in terms of cost are caused by floods impacting property and drought events impact on agriculture. To investigate these events a data set from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database was obtained. This contains data for major storms and weather events in the United States during the years from 1950 to 2011 and contains estimates on of any fatalities, injuries, and property damage. The storm data was processed and cleaned (detailed below) and then aggregated by weather event type to discover the most harmful and costly events.
The raw data set was originally sourced from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. For this report, the data was downloaded from this link.
Other documentation that was used to create this report were:
The following libraries were used in creating the report.
library(readr) # for reading in the BZipped CSV data
library(lubridate) # for handling dates
library(dplyr) # for processing data (aggregating, filtering etc)
library(ggplot2) # for plotting
library(stringr) # for string replacement when cleaing the EVTYPE data
library(tidyr) # for tidying and shaping data (gather) for plotting
library(knitr) # for using kable to generate tables
The zipped (BZ2) CSV file was loaded using readr::read_csv into a data set (tibble) called csv_storm_data. The file is a comma delimited file with missing values given as ‘NA’. The BGN_DATE column was used to determine when the weather event happened. The format of this date is explicity specified when reading in the data set since it is in a non-default format.
storm_data_col_types = cols(
BGN_DATE = col_date(format = "%m/%d/%Y %H:%M:%S")
)
csv_storm_data <- read_csv("repdata-data-StormData.csv.bz2", col_types = storm_data_col_types)
This creates a large object in memory.
format(object.size(csv_storm_data) , units = "Mb")
## [1] "469 Mb"
Since we do not require all the columns, only the ones directly required for analysis are selected. These are:
BGN_DATE - the date the event happenedEVTYPE - the weather event typeFATALITIES - an estimate of the number of fatalitiesINJURIES - an estimate of the number of injuriesPROPDMG - an estimate of the property damage in US DollarsPROPDMGEXP - an exponent for multiplying PROPDMGCROPDMG - an estimate of the crop damage in US DollarsCROPDMGEXP - an exponent for multiplying CROPDMGstorm_data <- csv_storm_data %>%
select(BGN_DATE, EVTYPE,
FATALITIES, INJURIES,
PROPDMG, PROPDMGEXP,
CROPDMG, CROPDMGEXP)
This greatly reduced the size of the data set.
format(object.size(storm_data) , units = "Mb")
## [1] "55.1 Mb"
The report looked at the number of different event types (EVTYPE) by year.
storm_year_event <- storm_data %>%
mutate(event_year = year(BGN_DATE)) %>%
group_by(event_year, EVTYPE) %>%
summarise(event_count = n())
# Show a count of the different EVTYPEs by year
table(storm_year_event$event_year)
##
## 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964
## 1 1 1 1 1 3 3 3 3 3 3 3 3 3 3
## 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979
## 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994
## 3 3 3 3 3 3 3 3 3 3 3 3 3 160 267
## 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
## 387 227 169 124 119 110 120 99 51 38 46 50 46 46 46
## 2010 2011
## 46 46
From the table above, many more types of event were recorded from 1994 onwards.
storm_year_event %>%
group_by(event_year) %>%
summarise(n_events = sum(event_count)) %>%
arrange(desc(event_year)) %>%
head(20)
## # A tibble: 20 × 2
## event_year n_events
## <dbl> <int>
## 1 2011 62174
## 2 2010 48161
## 3 2009 45817
## 4 2008 55663
## 5 2007 43289
## 6 2006 44034
## 7 2005 39184
## 8 2004 39363
## 9 2003 39752
## 10 2002 36293
## 11 2001 34962
## 12 2000 34471
## 13 1999 31289
## 14 1998 38128
## 15 1997 28680
## 16 1996 32270
## 17 1995 27970
## 18 1994 20631
## 19 1993 12607
## 20 1992 13534
There is also a jump in the number of events between 1993 and 1994 and future years. Therefore, the report looked only at data from 1994 onwards to 2011.
# extract data from 1994 onwards
storm_data <- storm_data %>% filter(BGN_DATE >= "1994-01-01")
Check the first few rows:
kable(head(storm_data), caption = "Filtered storm data from 1994 onwards (first few rows)")
| BGN_DATE | EVTYPE | FATALITIES | INJURIES | PROPDMG | PROPDMGEXP | CROPDMG | CROPDMGEXP |
|---|---|---|---|---|---|---|---|
| 1995-01-06 | FREEZING RAIN | 0 | 0 | 0.0 | NA | 0 | NA |
| 1995-01-22 | SNOW | 0 | 0 | 0.0 | NA | 0 | NA |
| 1994-02-09 | ICE STORM/FLASH FLOOD | 0 | 2 | 0.0 | NA | 0 | NA |
| 1995-02-06 | SNOW/ICE | 0 | 0 | 0.0 | NA | 0 | NA |
| 1995-02-11 | SNOW/ICE | 0 | 0 | 0.0 | NA | 0 | NA |
| 1995-10-04 | HURRICANE OPAL/HIGH WINDS | 2 | 0 | 0.1 | B | 10 | M |
This reduces the size of the storm_data set to 702,131 rows.
The EVTYPE data has a mixture of upper and lower case. To make this consistent, the event types were converted to all upper case.
storm_data <- storm_data %>% mutate(EVTYPE = toupper(EVTYPE))
A list of standard EVTYPE is provided in the “National Weather Service Instruction document”, on “Table 2.1.1 Storm Data Event Table”.
By looking at the EVTYPE in the data set, it can be seen that there are quite a few values that do not correspond to the documented list. For example, “THUNDERSTORMW WINDS”, “TSTM”, and “THUNDERTORM WINDS” instead of “THUNDERSTORM WIND”.
Also, there are some EVTYPEs that use a different terminology to the standard list. The main ones that were found were:
To clean this data, a set of transformations were created to map non-standard EVTYPE values from the data set to the standard ones. This consisted of a regular expression to match the non-standard value and a corresponding string with the standard name. This was not an exact process as the regular expressions could be greedy and match too much of the EVTYPE value. I also looked at the function str_replace_all() - this was much faster, but it would match multiple expressions. This set of transformations was then used by a function to replace non-standard EVTYPEs with standard ones.
Tidy event types to match those in documentation
# A list of expressions to match and their replacement
event_translations <- read_csv(
"pattern,event_type
(.*FIRE.*),WILDFIRE
(.*HIGH WIND.*),HIGH WIND
(.*HEAVY RAIN.*),HEAVY RAIN
(.*(COLD|CHILL).*),COLD/WIND CHILL
(.*(FREEZ|FROST).*),FROST/FREEZE
(.*FLASH FLOOD.*),FLASH FLOOD
(.*FLOOD.*),FLOOD
(.*STREAM FLD.*),FLOOD
(.*HAIL.*),HAIL
(^TORNADO.*),TORNADO
(LANDSLIDE),DEBRIS FLOW
(HURRICANE.*),HURRICANE (TYPHOON)
(.*TYPHOON.*),HURRICANE (TYPHOON)
(S.*THUNDER.*),THUNDERSTORM WIND
(THUNDERSTORM.*WIN.*),THUNDERSTORM WIND
(^TSTM.*),THUNDERSTORM WIND
(^TROPICAL STORM.*),TROPICAL STORM
(.*WIND.*),STRONG WIND
(.*EX.*HEAT.*),EXCESSIVE HEAT
(.*HEAT.*),HEAT
(.*RAIN.*),HEAVY RAIN
(.*WET.*),HEAVY RAIN
(.*SURGE.*),STORM SURGE/TIDE
(.*THUNDE.*),THUNDERSTORM WIND
(.*SNOW.*),HEAVY SNOW
(.*FOG.*),DENSE FOG
(.*SURF.*),HIGH SURF
(.*HIGH TIDE.*),STORM SURGE/TIDE
(.*WINTER WEATHER.*),WINTER WEATHER
(.*MICROBURST.*),THUNDERSTORM WIND
(.*WATERSPOUT.*),WATERSPOUT
(.*GUSTNADO.*),THUNDERSTORM WIND
(.*ICE.*),WINTER STORM")
translate_event_type <- function(event_string) {
# save the maximum number of regular expressions we have to process for each EVTYPE
event_translations_length <- nrow(event_translations)
unlist(
lapply(event_string, function(x) {
for (ix in 1:event_translations_length) {
# Search for a matching string and replace if found
event_tidy <- str_replace(x,
event_translations[ix,1]$pattern,
event_translations[ix,2]$event_type)
# If we have replaced the text, then exit out of the for-loop
if ( event_tidy != x ) {
break
}
}
# Return the EVTYPE or the cleaned EVTYPE
event_tidy
})
)
}
Note that there are 702,131 rows of data in storm_data. Running the translate_event_type() function on this data set took a long time (more than 10 minutes). So, the actual cleaning of EVTYPE was done on aggregated data as below.
Events where no fatalities or injuries occurred were filtered out - again reducing the number of rows
storm_harm <- storm_data %>% filter(FATALITIES > 0 | INJURIES > 0)
To reduce the size of the data sets, separate data sets were created for Fatalities and Injuries. These were then cleaned as described above, the EVTYPE was standardised. The data set was aggregated based on the EVTYPE.
# Aggregate by EVTYPE to remove date data
fatal_sum <- storm_harm %>% filter(FATALITIES > 0) %>%
group_by(EVTYPE) %>%
summarise(fatalities_count = sum(FATALITIES))
# Aggregate by cleaned EVTYPE
fatal_sum_clean <- fatal_sum %>% mutate(EVTYPE = translate_event_type(EVTYPE)) %>%
group_by(EVTYPE) %>%
summarise(fatalities_total = sum(fatalities_count)) %>%
arrange(desc(fatalities_total))
The same steps were performed for injuries:
injury_sum <- storm_harm %>% filter(INJURIES > 0) %>%
group_by(EVTYPE) %>%
summarise(injuries_count = sum(INJURIES))
injury_sum_clean <- injury_sum %>%
mutate(EVTYPE = translate_event_type(EVTYPE)) %>%
group_by(EVTYPE) %>%
summarise(injuries_total = sum(injuries_count)) %>%
arrange(desc(injuries_total))
Similar to fatalities and injuries, the EVTYPE values were cleaned and the data sets were aggregated by that variable.
Events where no damage occurred were filtered out.
storm_damage <- storm_data %>% filter(CROPDMG > 0 | PROPDMG > 0)
To calculate damage, the cost (CROPDMG and PROPDMG) had to be multiplied by a factor that was determined by a corresponding exponent column (CROPDMGEXP and PROPDMGEXP).
There were some issues related to this calculation and “How To Handle Exponent Value of PROPDMGEXP and CROPDMGEXP” web page was very helpful.
For example, there are some storms where property damage is non-zero, but the exponent is NA
storm_damage %>% filter(PROPDMG != 0) %>% filter(is.na(PROPDMGEXP))
## # A tibble: 70 × 8
## BGN_DATE EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP
## <date> <chr> <dbl> <dbl> <dbl> <chr>
## 1 1995-03-09 FLASH FLOOD WINDS 0 0 0.41 <NA>
## 2 1995-08-19 THUNDERSTORM WIND 0 0 2.00 <NA>
## 3 1995-05-14 FLASH FLOOD 0 0 10.00 <NA>
## 4 1995-05-18 FLASH FLOOD 0 0 10.00 <NA>
## 5 1995-05-17 FLASH FLOOD 0 0 10.00 <NA>
## 6 1995-05-28 THUNDERSTORM WINDS 0 0 4.00 <NA>
## 7 1995-05-12 THUNDERSTORM WINDS 0 0 5.00 <NA>
## 8 1995-04-08 HAIL 0 0 10.00 <NA>
## 9 1995-04-08 HAIL 0 0 35.00 <NA>
## 10 1995-05-23 FLASH FLOOD 0 0 75.00 <NA>
## # ... with 60 more rows, and 2 more variables: CROPDMG <dbl>,
## # CROPDMGEXP <chr>
There are similar cases for crop damage data.
storm_damage %>% filter(CROPDMG != 0) %>% filter(is.na(CROPDMGEXP)) %>% select(BGN_DATE, EVTYPE, CROPDMG, CROPDMGEXP)
## # A tibble: 3 × 4
## BGN_DATE EVTYPE CROPDMG CROPDMGEXP
## <date> <chr> <dbl> <chr>
## 1 1994-07-04 HAIL 3 <NA>
## 2 1994-04-05 THUNDERSTORM WINDS 4 <NA>
## 3 1994-04-15 THUNDERSTORM WINDS 4 <NA>
To calculate the crop and property cost, a function to translate the exponent to a number was created:
# See https://rstudio-pubs-static.s3.amazonaws.com/58957_37b6723ee52b455990e149edde45e5b6.html
# Overall conclusion for all numeric, exp 0, 1, 2, 3, 4, 5, 6, 7, 8, they are multiplier of 10.
# e.g. 50 exp 0 is 500, 88 exp 5 is 880
# empty-character () is == multiplier of 0.
# Translate exponent character to a number
exp_translations <- c("+" = 1,
"H" = 100,
"K" = 1000,
"M" = 1000000,
"B" = 1000000000)
convert_exp <- function(exp_str) {
# check if the EXP is a number
exp_number <- as.integer(exp_str)
# check if the EXP can be mapped to a number
exp_factor <- ifelse(is.na(exp_number), exp_translations[toupper(exp_str)], 10)
# otherwise set it to 0
exp_factor <- ifelse(is.na(exp_factor), 0, exp_factor)
exp_factor
}
The crop and damage costs were then calculated using the above function as follows:
storm_damage <- storm_damage %>%
mutate(crop_damage = convert_exp(CROPDMGEXP) * CROPDMG,
prop_damage = convert_exp(PROPDMGEXP) * PROPDMG)
## Warning in convert_exp(c("M", NA, "K", NA, NA, "M", NA, "M", "m", NA, NA, :
## NAs introduced by coercion
## Warning in convert_exp(c("B", "K", "M", "K", "K", "M", "K", "M", "m",
## "K", : NAs introduced by coercion
A data set containing events that caused crop damage was created. This was aggregated by the event type (EVTYPE).
crop_sum <- storm_damage %>% filter(CROPDMG > 0) %>%
group_by(EVTYPE) %>%
summarise(crop_damage_total = sum(crop_damage))
This is now a much smaller data set (123 rows) and the translate_event_type() function can be executed and run with in a short time (a few seconds).
crop_sum_clean <- crop_sum %>%
mutate(EVTYPE = translate_event_type(EVTYPE)) %>%
group_by(EVTYPE) %>%
summarise(crop_damage_cost = sum(crop_damage_total)) %>%
arrange(desc(crop_damage_cost))
The same steps were performed for property damage:
prop_sum <- storm_damage %>% filter(PROPDMG > 0) %>%
group_by(EVTYPE) %>%
summarise(prop_damage_total = sum(prop_damage))
prop_sum_clean <- prop_sum %>%
mutate(EVTYPE = translate_event_type(EVTYPE)) %>%
group_by(EVTYPE) %>%
summarise(prop_damage_cost = sum(prop_damage_total)) %>%
arrange(desc(prop_damage_cost))
From the fatal_sum_clean and injury_sum_clean data sets, it can be seen that injuries are more common than fatalities.
Fatalities
kable(head(fatal_sum_clean),
caption = "Fatalities by Weather Event (first few rows)",
format.args = list(big.mark = ','))
| EVTYPE | fatalities_total |
|---|---|
| HEAT | 3,016 |
| TORNADO | 1,593 |
| FLOOD | 1,443 |
| LIGHTNING | 794 |
| STRONG WIND | 645 |
| RIP CURRENT | 368 |
Injuries
kable(head(injury_sum_clean),
caption = "Injuries by Weather Event (first few rows)",
format.args = list(big.mark = ','))
| EVTYPE | injuries_total |
|---|---|
| TORNADO | 22,589 |
| HEAT | 9,064 |
| FLOOD | 8,623 |
| LIGHTNING | 5,116 |
| THUNDERSTORM WIND | 4,445 |
| WINTER STORM | 3,375 |
The two data sets were cut off data points at an appropriate quantile. In this way, only the top most harmful weather events were plotted and tabulated below. This was found by trying different quantiles and plotting the results.
quantile(fatal_sum_clean$fatalities_total)
## 0% 25% 50% 75% 100%
## 1.0 2.0 12.0 112.5 3016.0
quantile(injury_sum_clean$injuries_total)
## 0% 25% 50% 75% 100%
## 1.0 5.0 129.0 488.5 22589.0
Plot a facet grid between injuries and fatalities
fatal_cut_off <- quantile(fatal_sum_clean$fatalities_total)[4]
injury_cut_off <- quantile(injury_sum_clean$injuries_total)[4]
# Join the two data sets into one
storm_harm_total <- full_join(fatal_sum_clean, injury_sum_clean, by = c("EVTYPE") )
# Combine fatalities and injuries into one column
storm_harm_type <- storm_harm_total %>%
select(event = EVTYPE,
Fatalities = fatalities_total,
Injuries = injuries_total) %>%
gather(harm_type, harm_total, -event) %>%
filter( (harm_type == 'Fatalities' & harm_total >= fatal_cut_off) |
(harm_type == 'Injuries' & harm_total >= injury_cut_off) )
# Plot results - note that reoder() does not work as expected - perhaps due to facet_grid()
storm_harm_type %>%
ggplot(aes(reorder(event, harm_total), harm_total, fill = event) ) +
geom_bar(stat = "identity") +
coord_flip() +
theme(legend.position = "none") +
facet_grid(harm_type ~ .) +
labs(x = "Weather Event Type",
y = "Estimated accounts of injuries and fatalities") +
ggtitle("Most harmful weather events between 1994 and 2011")
Tabulated results show the most harmful events.
harm <- as.data.frame(storm_harm_type %>% arrange(desc(harm_total)))
kable(head(harm,10), caption = "Most harmful weather events ordered by most number of cases first", format.args = list(big.mark = ','))
| event | harm_type | harm_total |
|---|---|---|
| TORNADO | Injuries | 22,589 |
| HEAT | Injuries | 9,064 |
| FLOOD | Injuries | 8,623 |
| LIGHTNING | Injuries | 5,116 |
| THUNDERSTORM WIND | Injuries | 4,445 |
| WINTER STORM | Injuries | 3,375 |
| STRONG WIND | Injuries | 3,050 |
| HEAT | Fatalities | 3,016 |
| TORNADO | Fatalities | 1,593 |
| WILDFIRE | Injuries | 1,458 |
So, tornadoes cause the most injuries, while heat events cause the more fatalities.
From the prop_sum_clean and prop_sum_clean data sets, it can be seen that damage to property is larger than crop damage.
Crop Damage
kable(head(crop_sum_clean), caption = "Crop damage per weather event in US Dollars (first few rows, ordered by most expensive first)", format.args = list(big.mark = ','))
| EVTYPE | crop_damage_cost |
|---|---|
| DROUGHT | 13,922,066,000 |
| FLOOD | 7,097,621,550 |
| HURRICANE (TYPHOON) | 5,505,617,800 |
| WINTER STORM | 5,034,058,300 |
| HAIL | 3,068,457,700 |
| FROST/FREEZE | 1,886,061,000 |
Property Damage
kable(head(prop_sum_clean), caption = "Property damage per weather event in US Dollars (first few rows, ordered by most expensive first)", format.args = list(big.mark = ','))
| EVTYPE | prop_damage_cost |
|---|---|
| FLOOD | 160,707,771,901 |
| HURRICANE (TYPHOON) | 85,200,910,010 |
| STORM SURGE/TIDE | 47,844,150,500 |
| TORNADO | 25,625,141,797 |
| HAIL | 15,622,770,827 |
| STRONG WIND | 8,945,965,810 |
For generating a plot of crop and property damage, the data sets were filtered based on a appropriate value from the above quantiles. This was found by trying different quantiles and plotting the results.
quantile(crop_sum_clean$crop_damage_cost)
## 0% 25% 50% 75% 100%
## 10000 9523545 361834110 1273116700 13922066000
quantile(prop_sum_clean$prop_damage_cost)
## 0% 25% 50% 75% 100%
## 50 30000 712630 144062000 160707771901
crop_cut_off_cost <- quantile(crop_sum_clean$crop_damage_cost)[3]
prop_cut_off_cost <- quantile(prop_sum_clean$prop_damage_cost)[4]
A facet grid between crop and property damage was plotted.
storm_damage_total <- full_join(crop_sum_clean, prop_sum_clean, by = c("EVTYPE") )
storm_damage_type <- storm_damage_total %>%
select(event = EVTYPE,
Crop = crop_damage_cost,
Property = prop_damage_cost) %>%
gather(damage_type, damage_cost, -event) %>%
filter( (damage_type == 'Crop' & damage_cost >= crop_cut_off_cost) |
(damage_type == 'Property' & damage_cost >= prop_cut_off_cost) )
storm_damage_type %>%
ggplot(aes(reorder(event, damage_cost), damage_cost/1000000, fill = event) ) +
geom_bar(stat = "identity") +
coord_flip() +
theme(legend.position = "none") +
facet_grid(damage_type ~ .) +
labs(x = "Weather Event Type",
y = "Estimated cost of crop and property damage in US Dollars (Millions)") +
ggtitle("Highest economic impact weather events between 1994 and 2011")
damage <- as.data.frame(storm_damage_type %>% arrange(desc(damage_cost)))
kable(head(damage,10),
caption = "Top most damage causing weather events by type of damage in US Dollars",
format.args = list(big.mark = ','))
| event | damage_type | damage_cost |
|---|---|---|
| FLOOD | Property | 160,707,771,901 |
| HURRICANE (TYPHOON) | Property | 85,200,910,010 |
| STORM SURGE/TIDE | Property | 47,844,150,500 |
| TORNADO | Property | 25,625,141,797 |
| HAIL | Property | 15,622,770,827 |
| DROUGHT | Crop | 13,922,066,000 |
| STRONG WIND | Property | 8,945,965,810 |
| WILDFIRE | Property | 7,882,628,500 |
| TROPICAL STORM | Property | 7,713,885,550 |
| FLOOD | Crop | 7,097,621,550 |
So, floods have the greatest economic impact on property, while droughts have the greatest economic impact on crops.
Created using RStudio with the following version of R:
R.version
## _
## platform x86_64-apple-darwin16.5.0
## arch x86_64
## os darwin16.5.0
## system x86_64, darwin16.5.0
## status
## major 3
## minor 4.0
## year 2017
## month 04
## day 21
## svn rev 72570
## language R
## version.string R version 3.4.0 (2017-04-21)
## nickname You Stupid Darkness