The impact of natural disasters on the United States

Abstract

In this analysis the following two questions will be answered:

Across the United States, which types of events are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?

The data is preprocessed to acquire a tidy dataset. Then the resulting dataset is used to compute summarized quantities as e.g. the total number of casualties caused by a certain event type. The quantities are visualized to answer the two questions from above

Data processing

The following packages are used for the dataprocessing. A special package is the the urbnmapr package that is meant to simplify the creation of state and county choropleth maps. It can be installed using devtools devtools::install_github(“UrbanInstitute/urbnmapr”).

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following object is masked from 'package:base':
## 
##     date

library(ggplot2)
library(tidyr)
#this library below is from github and is used for visualizing states data 
library(urbnmapr)

Preprocessing

The data is downloaded if no file is found in the current working directory.

link <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
filename = "storm_data.csv.bz2"
if(!file.exists(filename)){
  download.file(link, filename)
}

The data is is loaded into R with the read.csv function and stored in a tidyr table.

#read the data
df_data <- read.csv(filename, header = TRUE, sep = ",") %>%
  tbl_df

The data sample in heterogeneous regarding the measurements of many variables such as the date and time variable as well as the damage variables.

Date and time processing

The time and date column are converted into a single date time column:

Convert the columns BGN_DATE, BGN_TIME into a new datetime column BGN_DATETIME.
Convert the columns END_DATE, END_TIME into a new datetime column END_DATETIME.

Split the dataset into two parts, coerce date and time columns into characters.

df_data <- df_data %>% mutate(BGN_DATE = as.character(BGN_DATE),
                              BGN_TIME = as.character(BGN_TIME),
                              END_DATE = as.character(END_DATE),
                              END_TIME = as.character(END_TIME)) 

#Split the dataset
is_new_format <- grepl("(AM)|(PM)", df_data$BGN_TIME)
df_data_old_format <- subset(df_data, !is_new_format)
df_data_new_format <- subset(df_data, is_new_format)

Time transformation for the old formatted part is applied. Unfortunately make_datetime does not support the use of ifelse. This is why additional variables have to be specified. The date is parsed using the mdy_hms method from the lubridate package.

#Apply the format conversions to the subsets

df_data_old_format_raw <- df_data_old_format %>% 
                                          mutate(BGN_DATETIME = mdy_hms(BGN_DATE),
                                                 BGN_TIME_TEMP = parse_date_time(BGN_TIME, "%H%M"),
                                                 JOIN_ID = c(1:nrow(df_data_old_format))) %>%
                                          mutate(GOOD_START = !is.na(BGN_TIME_TEMP))

## Warning: 5 failed to parse.

#Set the begin time
df_data_old_format_BGN <- df_data_old_format_raw %>% 
                                          filter(GOOD_START) %>%
                                          mutate(BGN_DATETIME2 = make_datetime(
                                                                  year = year(BGN_DATETIME),
                                                                  month = month(BGN_DATETIME),
                                                                  day = day(BGN_DATETIME),
                                                                  hour = hour(BGN_TIME_TEMP),
                                                                  min = minute(BGN_TIME_TEMP),
                                                                  sec = second(BGN_TIME_TEMP))) %>%
                                          select(BGN_DATETIME2, JOIN_ID)

#select the columns that matter
df_data_old_format <- left_join(df_data_old_format_raw, df_data_old_format_BGN, by = "JOIN_ID") %>%
                      mutate(BGN_DATETIME = BGN_DATETIME2) %>%
                      select(-BGN_DATETIME2, -GOOD_START, -BGN_TIME_TEMP, -JOIN_ID)

Not all entries are in a valid format (5 failed!). The elements that failed to parse are replaced with NA as no meaningful transformation can be applied.

A similar transformation is applied to the newer part of the dataset. Again the lubridate parsing functions are used:

#Apply the format conversions to the subsets

df_data_new_format_raw <- df_data_new_format %>% 
                                          mutate(BGN_DATETIME = mdy_hms(BGN_DATE),
                                                 END_DATETIME = mdy_hms(END_DATE),
                                                 BGN_TIME_TEMP = parse_date_time(BGN_TIME, "%H:%M:%S %p"),
                                                 END_TIME_TEMP = parse_date_time(END_TIME, "%H:%M:%S %p"),
                                                 JOIN_ID = c(1:nrow(df_data_new_format))) %>%
                                          mutate(GOOD_START = !is.na(BGN_TIME_TEMP),
                                                 GOOD_END = !is.na(END_TIME_TEMP))
#Set the begin time
df_data_new_format_BGN <- df_data_new_format_raw %>% 
                                          filter(GOOD_START) %>%
                                          mutate(BGN_DATETIME2 = make_datetime(
                                                                  year = year(BGN_DATETIME),
                                                                  month = month(BGN_DATETIME),
                                                                  day = day(BGN_DATETIME),
                                                                  hour = hour(BGN_TIME_TEMP),
                                                                  min = minute(BGN_TIME_TEMP),
                                                                  sec = second(BGN_TIME_TEMP))) %>%
                                          select(BGN_DATETIME2, JOIN_ID)
#Set the end time
df_data_new_format_END <- df_data_new_format_raw %>% 
                                          filter(GOOD_END) %>%
                                          mutate(END_DATETIME2 = make_datetime(
                                                                  year = year(END_DATETIME),
                                                                  month = month(END_DATETIME),
                                                                  day = day(END_DATETIME),
                                                                  hour = hour(END_TIME_TEMP),
                                                                  min = minute(END_TIME_TEMP),
                                                                  sec = second(END_TIME_TEMP))) %>%
                                          select(END_DATETIME2, JOIN_ID)

df_data_new_format <- left_join(df_data_new_format_raw, df_data_new_format_BGN, by = "JOIN_ID") %>%
                      left_join(df_data_new_format_END, by = "JOIN_ID") %>%
                      mutate(END_DATETIME = END_DATETIME2,
                             BGN_DATETIME = BGN_DATETIME2) %>%
                      select(-BGN_DATETIME2,
                             -END_DATETIME2,
                             -GOOD_START,
                             -GOOD_END,
                             -BGN_TIME_TEMP,
                             -END_TIME_TEMP,
                             -JOIN_ID)

The two separate datasets are again joined. The columns are sorted by the *STATE__* and the BGN_DATETIME variable.

df_data_tidy <- bind_rows(df_data_old_format, df_data_new_format)

#Move the two last columns to the second position
col_order <- append(head(names(df_data_tidy), n = ncol(df_data_tidy) - 2),tail(names(df_data_tidy), n = 2), after = 1)
df_data_tidy <- df_data_tidy[,col_order] %>% arrange(STATE__,BGN_DATETIME)

Preprocessing damage data.

According to the data handbook the DMGEXP columns for the crop and the property damages contain indentifiers of the dimension of the damages. For many older data entries this information was not available but the respective columns contained a numeric. It is further assumed that the numeric should be used as a factor. A factor is computed based on the the DMGEXP column that is used to scale the values in the DMG column accordingly. The results are stored in the new columns PROPDMGTIDY, CROPDMGTIDY and TOTALDMGTIDY.

df_data_tidy <- df_data_tidy %>%
                    mutate(CROPDMGEXP = as.character(CROPDMGEXP),
                           PROPDMGEXP = as.character(CROPDMGEXP)) %>%
                    mutate(crop_factor = case_when(CROPDMGEXP == "K" | CROPDMGEXP == "k" ~  1000,
                                                   CROPDMGEXP == "M" | CROPDMGEXP == "m" ~  1000000,
                                                   CROPDMGEXP == "B" | CROPDMGEXP == "b" ~  1000000000,
                                                   TRUE ~ as.numeric(CROPDMGEXP)),
                           prop_factor = case_when(PROPDMGEXP == "K" | PROPDMGEXP == "k" ~  1000,
                                                   PROPDMGEXP == "M" | PROPDMGEXP == "m" ~  1000000,
                                                   PROPDMGEXP == "B" | PROPDMGEXP == "b" ~  1000000000,
                                                   TRUE ~ as.numeric(CROPDMGEXP))) %>%
                    mutate(PROPDMGTIDY = as.numeric(PROPDMG * prop_factor),
                           CROPDMGTIDY = as.numeric(CROPDMG * crop_factor)) %>%
                    mutate(TOTALDMGTIDY = PROPDMGTIDY + CROPDMGTIDY) %>%
                    select(-crop_factor, -prop_factor)

## Warning in eval_bare(f[[3]], env): NAs durch Umwandlung erzeugt

## Warning in eval_bare(f[[3]], env): NAs durch Umwandlung erzeugt

Any values that are not covered by the case statement are replaced with NA.

Data analysis

Events most harmful to human health

The total number of injuries and fatalities is computed for all event classes. The severity of the event class is highlighted by the place column. It was computed by only taking into account the lives-lost metric thus not considering any injuries.

df_fatality <-  df_data_tidy %>%  select(EVTYPE,FATALITIES, INJURIES) %>% group_by(EVTYPE) %>% summarize(Fatalities = sum(FATALITIES), Injuries = sum(INJURIES)) %>% arrange(desc(Fatalities, Injuries))

#add the fatalities
df_fatality <- df_fatality %>% mutate(place = c(1:nrow(df_fatality)))

Events with the greatest economic impact?

For this analysis the order of economic impact was determined by the property damages alone as property damages have a stronger financial impact than just the destruction of property in the long run. The average value for property and crop damages is computed and stored in a new data frame df_damages.

df_damages <-  df_data_tidy %>%  select(EVTYPE,PROPDMGTIDY, CROPDMGTIDY) %>% group_by(EVTYPE) %>% summarize(PropertyDamages = mean(PROPDMGTIDY), CropDamages = mean(CROPDMGTIDY)) %>% arrange(desc(PropertyDamages, CropDamages))

#add the fatalities
df_damages <- df_damages %>% mutate(place = c(1:nrow(df_damages)))

Compute the aggregated total damages and group by the state in which they occured. Save the result in a new dataframe df_damages_states.

df_damages_states <-  df_data_tidy %>%
                        select(STATE, TOTALDMGTIDY) %>%
                        filter(!is.na(TOTALDMGTIDY)) %>%
                        group_by(STATE) %>% 
                        summarize(average_total_damages = mean(TOTALDMGTIDY)) %>%
                        inner_join(states, by = c("STATE" = "state_abbv"))

## Warning: Column `STATE`/`state_abbv` joining factor and character vector,
## coercing into character vector

Results

Question 1: Events most harmful to human health

The plot below provides an overview of the most harmful event types. A logarithmic scaling has been applied to the y-axis to improve readability. The plot shows that the 5 most harmful events to human health across the United States are Tornados, Excessive Heat, Flash Floods, Heat and Lightning.

#let's focus on the top 20
df_sub <- head(df_fatality, n = 20)

#change from wide to long format using dplyrs gather function

df_sub_long <- gather(df_sub, type, count, Fatalities:Injuries, factor_key = TRUE)
labels <-  append(df_sub_long[1:20,"EVTYPE"],rep("",20))
g <- ggplot(data = df_sub_long, aes(x=place, y = count, label = append(as.character(EVTYPE[1:20]),rep("",20)), fill = type))

g+
  geom_bar(stat = "identity", position = "dodge") + 
  geom_text(aes(y = 1) ,angle = 90, size = 2.5, hjust = -0.05) +
  scale_y_continuous(trans='log10') +
  xlab("Event type") +
  ylab("Total number of Fatalities/Injuries") +
  ggtitle("Total number of Fatalities/Injuries by event type")

A summary of the data is provided by the table below:

df_fatality

## # A tibble: 985 x 4
##    EVTYPE         Fatalities Injuries place
##    <fct>               <dbl>    <dbl> <int>
##  1 TORNADO              5633    91346     1
##  2 EXCESSIVE HEAT       1903     6525     2
##  3 FLASH FLOOD           978     1777     3
##  4 HEAT                  937     2100     4
##  5 LIGHTNING             816     5230     5
##  6 TSTM WIND             504     6957     6
##  7 FLOOD                 470     6789     7
##  8 RIP CURRENT           368      232     8
##  9 HIGH WIND             248     1137     9
## 10 AVALANCHE             224      170    10
## # ... with 975 more rows

Question 2: Events with the greatest economic impact

An overview of the results is provided in the table below: As for the analysis method explained above the greatest economic impact is caused by the events Winterstorms, Heavy Rain, High Winds, Tornados and Tropical Storms.

df_damages

## # A tibble: 985 x 4
##    EVTYPE                     PropertyDamages CropDamages place
##    <fct>                                <dbl>       <dbl> <int>
##  1 WINTER STORM HIGH WINDS           60000000     5000000     1
##  2 Heavy Rain/High Surf              13500000     1500000     2
##  3 HIGH WINDS/COLD                   10112000     1400000     3
##  4 TORNADOES, TSTM WIND, HAIL         1600000     2500000     4
##  5 TROPICAL STORM GORDON               500000      500000     5
##  6 HAIL/WINDS                          250000       25025     6
##  7 FLASH FLOODING/FLOOD                218750       21875     7
##  8 HEAT WAVE DROUGHT                   200000       50000     8
##  9 HURRICANE OPAL/HIGH WINDS           100000    10000000     9
## 10 DUST STORM/HIGH WINDS                50000      500000    10
## # ... with 975 more rows

For certain event types like for examle the Astronomical Low Tide no crop damage data is available even though properties have been destroyed. This as well as the fact that other event types like High Winds cause far greater damages to crops than to other properties indicates that the analysis should be reviewed. The data is visualized in the plot below. The plot depicts the most harmful average crop and property damages.

#let's focus on the top 20
df_sub <- head(df_damages, n = 20)

#change from wide to long format using dplyrs gather function

df_sub_long <- gather(df_sub, type, count, PropertyDamages:CropDamages, factor_key = TRUE)
labels <-  append(df_sub_long[1:20,"EVTYPE"],rep("",20))
g <- ggplot(data = df_sub_long, aes(x=place, y = count, label = append(as.character(EVTYPE[1:20]),rep("",20)), fill = type))

g+
  geom_bar(stat = "identity", position = "dodge") + 
  geom_text(aes(y = 1) ,angle = 90, size = 2.5, hjust = -0.05) +
  scale_y_continuous(trans='log10')

## Warning: Transformation introduced infinite values in continuous y-axis

To further address the question of greates economic impact the total damages caused by the events are aggregated by the state in which they occured.

The result below shows that one state in particular was affected much stronger than all other states: The state of Mississippi accounted for an order of magnitude higher total damages than all others states. A potential explanation for this might be the Hurricane Katrina of 2005.

df_damages_states %>%

  ggplot(aes(long, lat, group = group, fill = log10(average_total_damages))) +

  geom_polygon(color = NA) +

  coord_map(projection = "albers", lat0 = 39, lat1 = 45) +

  labs(fill = "Log10 of average Damages") +
  xlab("Longitude") +
  ylab("Latitude")

Course Project 2 - The impact of storms

EPVogel

7 November 2018