In this analysis the following two questions will be answered:
The data is preprocessed to acquire a tidy dataset. Then the resulting dataset is used to compute summarized quantities as e.g. the total number of casualties caused by a certain event type. The quantities are visualized to answer the two questions from above
The following packages are used for the dataprocessing. A special package is the the urbnmapr package that is meant to simplify the creation of state and county choropleth maps. It can be installed using devtools devtools::install_github(“UrbanInstitute/urbnmapr”).
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
library(ggplot2)
library(tidyr)
#this library below is from github and is used for visualizing states data
library(urbnmapr)
The data is downloaded if no file is found in the current working directory.
link <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
filename = "storm_data.csv.bz2"
if(!file.exists(filename)){
download.file(link, filename)
}
The data is is loaded into R with the read.csv function and stored in a tidyr table.
#read the data
df_data <- read.csv(filename, header = TRUE, sep = ",") %>%
tbl_df
The data sample in heterogeneous regarding the measurements of many variables such as the date and time variable as well as the damage variables.
The time and date column are converted into a single date time column:
Split the dataset into two parts, coerce date and time columns into characters.
df_data <- df_data %>% mutate(BGN_DATE = as.character(BGN_DATE),
BGN_TIME = as.character(BGN_TIME),
END_DATE = as.character(END_DATE),
END_TIME = as.character(END_TIME))
#Split the dataset
is_new_format <- grepl("(AM)|(PM)", df_data$BGN_TIME)
df_data_old_format <- subset(df_data, !is_new_format)
df_data_new_format <- subset(df_data, is_new_format)
Time transformation for the old formatted part is applied. Unfortunately make_datetime does not support the use of ifelse. This is why additional variables have to be specified. The date is parsed using the mdy_hms method from the lubridate package.
#Apply the format conversions to the subsets
df_data_old_format_raw <- df_data_old_format %>%
mutate(BGN_DATETIME = mdy_hms(BGN_DATE),
BGN_TIME_TEMP = parse_date_time(BGN_TIME, "%H%M"),
JOIN_ID = c(1:nrow(df_data_old_format))) %>%
mutate(GOOD_START = !is.na(BGN_TIME_TEMP))
## Warning: 5 failed to parse.
#Set the begin time
df_data_old_format_BGN <- df_data_old_format_raw %>%
filter(GOOD_START) %>%
mutate(BGN_DATETIME2 = make_datetime(
year = year(BGN_DATETIME),
month = month(BGN_DATETIME),
day = day(BGN_DATETIME),
hour = hour(BGN_TIME_TEMP),
min = minute(BGN_TIME_TEMP),
sec = second(BGN_TIME_TEMP))) %>%
select(BGN_DATETIME2, JOIN_ID)
#select the columns that matter
df_data_old_format <- left_join(df_data_old_format_raw, df_data_old_format_BGN, by = "JOIN_ID") %>%
mutate(BGN_DATETIME = BGN_DATETIME2) %>%
select(-BGN_DATETIME2, -GOOD_START, -BGN_TIME_TEMP, -JOIN_ID)
Not all entries are in a valid format (5 failed!). The elements that failed to parse are replaced with NA as no meaningful transformation can be applied.
A similar transformation is applied to the newer part of the dataset. Again the lubridate parsing functions are used:
#Apply the format conversions to the subsets
df_data_new_format_raw <- df_data_new_format %>%
mutate(BGN_DATETIME = mdy_hms(BGN_DATE),
END_DATETIME = mdy_hms(END_DATE),
BGN_TIME_TEMP = parse_date_time(BGN_TIME, "%H:%M:%S %p"),
END_TIME_TEMP = parse_date_time(END_TIME, "%H:%M:%S %p"),
JOIN_ID = c(1:nrow(df_data_new_format))) %>%
mutate(GOOD_START = !is.na(BGN_TIME_TEMP),
GOOD_END = !is.na(END_TIME_TEMP))
#Set the begin time
df_data_new_format_BGN <- df_data_new_format_raw %>%
filter(GOOD_START) %>%
mutate(BGN_DATETIME2 = make_datetime(
year = year(BGN_DATETIME),
month = month(BGN_DATETIME),
day = day(BGN_DATETIME),
hour = hour(BGN_TIME_TEMP),
min = minute(BGN_TIME_TEMP),
sec = second(BGN_TIME_TEMP))) %>%
select(BGN_DATETIME2, JOIN_ID)
#Set the end time
df_data_new_format_END <- df_data_new_format_raw %>%
filter(GOOD_END) %>%
mutate(END_DATETIME2 = make_datetime(
year = year(END_DATETIME),
month = month(END_DATETIME),
day = day(END_DATETIME),
hour = hour(END_TIME_TEMP),
min = minute(END_TIME_TEMP),
sec = second(END_TIME_TEMP))) %>%
select(END_DATETIME2, JOIN_ID)
df_data_new_format <- left_join(df_data_new_format_raw, df_data_new_format_BGN, by = "JOIN_ID") %>%
left_join(df_data_new_format_END, by = "JOIN_ID") %>%
mutate(END_DATETIME = END_DATETIME2,
BGN_DATETIME = BGN_DATETIME2) %>%
select(-BGN_DATETIME2,
-END_DATETIME2,
-GOOD_START,
-GOOD_END,
-BGN_TIME_TEMP,
-END_TIME_TEMP,
-JOIN_ID)
The two separate datasets are again joined. The columns are sorted by the *STATE__* and the BGN_DATETIME variable.
df_data_tidy <- bind_rows(df_data_old_format, df_data_new_format)
#Move the two last columns to the second position
col_order <- append(head(names(df_data_tidy), n = ncol(df_data_tidy) - 2),tail(names(df_data_tidy), n = 2), after = 1)
df_data_tidy <- df_data_tidy[,col_order] %>% arrange(STATE__,BGN_DATETIME)
According to the data handbook the DMGEXP columns for the crop and the property damages contain indentifiers of the dimension of the damages. For many older data entries this information was not available but the respective columns contained a numeric. It is further assumed that the numeric should be used as a factor. A factor is computed based on the the DMGEXP column that is used to scale the values in the DMG column accordingly. The results are stored in the new columns PROPDMGTIDY, CROPDMGTIDY and TOTALDMGTIDY.
df_data_tidy <- df_data_tidy %>%
mutate(CROPDMGEXP = as.character(CROPDMGEXP),
PROPDMGEXP = as.character(CROPDMGEXP)) %>%
mutate(crop_factor = case_when(CROPDMGEXP == "K" | CROPDMGEXP == "k" ~ 1000,
CROPDMGEXP == "M" | CROPDMGEXP == "m" ~ 1000000,
CROPDMGEXP == "B" | CROPDMGEXP == "b" ~ 1000000000,
TRUE ~ as.numeric(CROPDMGEXP)),
prop_factor = case_when(PROPDMGEXP == "K" | PROPDMGEXP == "k" ~ 1000,
PROPDMGEXP == "M" | PROPDMGEXP == "m" ~ 1000000,
PROPDMGEXP == "B" | PROPDMGEXP == "b" ~ 1000000000,
TRUE ~ as.numeric(CROPDMGEXP))) %>%
mutate(PROPDMGTIDY = as.numeric(PROPDMG * prop_factor),
CROPDMGTIDY = as.numeric(CROPDMG * crop_factor)) %>%
mutate(TOTALDMGTIDY = PROPDMGTIDY + CROPDMGTIDY) %>%
select(-crop_factor, -prop_factor)
## Warning in eval_bare(f[[3]], env): NAs durch Umwandlung erzeugt
## Warning in eval_bare(f[[3]], env): NAs durch Umwandlung erzeugt
Any values that are not covered by the case statement are replaced with NA.
The total number of injuries and fatalities is computed for all event classes. The severity of the event class is highlighted by the place column. It was computed by only taking into account the lives-lost metric thus not considering any injuries.
df_fatality <- df_data_tidy %>% select(EVTYPE,FATALITIES, INJURIES) %>% group_by(EVTYPE) %>% summarize(Fatalities = sum(FATALITIES), Injuries = sum(INJURIES)) %>% arrange(desc(Fatalities, Injuries))
#add the fatalities
df_fatality <- df_fatality %>% mutate(place = c(1:nrow(df_fatality)))
For this analysis the order of economic impact was determined by the property damages alone as property damages have a stronger financial impact than just the destruction of property in the long run. The average value for property and crop damages is computed and stored in a new data frame df_damages.
df_damages <- df_data_tidy %>% select(EVTYPE,PROPDMGTIDY, CROPDMGTIDY) %>% group_by(EVTYPE) %>% summarize(PropertyDamages = mean(PROPDMGTIDY), CropDamages = mean(CROPDMGTIDY)) %>% arrange(desc(PropertyDamages, CropDamages))
#add the fatalities
df_damages <- df_damages %>% mutate(place = c(1:nrow(df_damages)))
Compute the aggregated total damages and group by the state in which they occured. Save the result in a new dataframe df_damages_states.
df_damages_states <- df_data_tidy %>%
select(STATE, TOTALDMGTIDY) %>%
filter(!is.na(TOTALDMGTIDY)) %>%
group_by(STATE) %>%
summarize(average_total_damages = mean(TOTALDMGTIDY)) %>%
inner_join(states, by = c("STATE" = "state_abbv"))
## Warning: Column `STATE`/`state_abbv` joining factor and character vector,
## coercing into character vector
The plot below provides an overview of the most harmful event types. A logarithmic scaling has been applied to the y-axis to improve readability. The plot shows that the 5 most harmful events to human health across the United States are Tornados, Excessive Heat, Flash Floods, Heat and Lightning.
#let's focus on the top 20
df_sub <- head(df_fatality, n = 20)
#change from wide to long format using dplyrs gather function
df_sub_long <- gather(df_sub, type, count, Fatalities:Injuries, factor_key = TRUE)
labels <- append(df_sub_long[1:20,"EVTYPE"],rep("",20))
g <- ggplot(data = df_sub_long, aes(x=place, y = count, label = append(as.character(EVTYPE[1:20]),rep("",20)), fill = type))
g+
geom_bar(stat = "identity", position = "dodge") +
geom_text(aes(y = 1) ,angle = 90, size = 2.5, hjust = -0.05) +
scale_y_continuous(trans='log10') +
xlab("Event type") +
ylab("Total number of Fatalities/Injuries") +
ggtitle("Total number of Fatalities/Injuries by event type")
A summary of the data is provided by the table below:
df_fatality
## # A tibble: 985 x 4
## EVTYPE Fatalities Injuries place
## <fct> <dbl> <dbl> <int>
## 1 TORNADO 5633 91346 1
## 2 EXCESSIVE HEAT 1903 6525 2
## 3 FLASH FLOOD 978 1777 3
## 4 HEAT 937 2100 4
## 5 LIGHTNING 816 5230 5
## 6 TSTM WIND 504 6957 6
## 7 FLOOD 470 6789 7
## 8 RIP CURRENT 368 232 8
## 9 HIGH WIND 248 1137 9
## 10 AVALANCHE 224 170 10
## # ... with 975 more rows
An overview of the results is provided in the table below: As for the analysis method explained above the greatest economic impact is caused by the events Winterstorms, Heavy Rain, High Winds, Tornados and Tropical Storms.
df_damages
## # A tibble: 985 x 4
## EVTYPE PropertyDamages CropDamages place
## <fct> <dbl> <dbl> <int>
## 1 WINTER STORM HIGH WINDS 60000000 5000000 1
## 2 Heavy Rain/High Surf 13500000 1500000 2
## 3 HIGH WINDS/COLD 10112000 1400000 3
## 4 TORNADOES, TSTM WIND, HAIL 1600000 2500000 4
## 5 TROPICAL STORM GORDON 500000 500000 5
## 6 HAIL/WINDS 250000 25025 6
## 7 FLASH FLOODING/FLOOD 218750 21875 7
## 8 HEAT WAVE DROUGHT 200000 50000 8
## 9 HURRICANE OPAL/HIGH WINDS 100000 10000000 9
## 10 DUST STORM/HIGH WINDS 50000 500000 10
## # ... with 975 more rows
For certain event types like for examle the Astronomical Low Tide no crop damage data is available even though properties have been destroyed. This as well as the fact that other event types like High Winds cause far greater damages to crops than to other properties indicates that the analysis should be reviewed. The data is visualized in the plot below. The plot depicts the most harmful average crop and property damages.
#let's focus on the top 20
df_sub <- head(df_damages, n = 20)
#change from wide to long format using dplyrs gather function
df_sub_long <- gather(df_sub, type, count, PropertyDamages:CropDamages, factor_key = TRUE)
labels <- append(df_sub_long[1:20,"EVTYPE"],rep("",20))
g <- ggplot(data = df_sub_long, aes(x=place, y = count, label = append(as.character(EVTYPE[1:20]),rep("",20)), fill = type))
g+
geom_bar(stat = "identity", position = "dodge") +
geom_text(aes(y = 1) ,angle = 90, size = 2.5, hjust = -0.05) +
scale_y_continuous(trans='log10')
## Warning: Transformation introduced infinite values in continuous y-axis
To further address the question of greates economic impact the total damages caused by the events are aggregated by the state in which they occured.
The result below shows that one state in particular was affected much stronger than all other states: The state of Mississippi accounted for an order of magnitude higher total damages than all others states. A potential explanation for this might be the Hurricane Katrina of 2005.
df_damages_states %>%
ggplot(aes(long, lat, group = group, fill = log10(average_total_damages))) +
geom_polygon(color = NA) +
coord_map(projection = "albers", lat0 = 39, lat1 = 45) +
labs(fill = "Log10 of average Damages") +
xlab("Longitude") +
ylab("Latitude")