Transform and tidy 3 datasets Perform Analysis on said datasets
(Albina Gallyavova)
Here we have some rate history on commercial paper, a short term debt instrument. Our goal is to create a table that can summarize it by various dimensions, while keeping some of the metadata present in the csv file
library(readr)
library(stringr)
library(tidyverse)
library(lubridate)
library(pander)
rates <- read_csv("https://raw.githubusercontent.com/TheFedExpress/DATA607/master/rates.csv")
meta <- read_csv("https://raw.githubusercontent.com/TheFedExpress/DATA607/master/rates.csv", n_max = 4)
rates <- rates [6:(nrow(rates)), ]
cols <-ncol(meta)
meta_t <- meta %>%
gather( 'col', 'desc', 2:25) %>%
spread('Series Description', 'desc')
# This sets up a joining the metadata to the actual data
rates_t <- rates %>% gather('col', 'rate', 2:25)
colnames(rates_t) <- unlist(str_extract_all(unlist(str_replace(colnames(rates_t), ' ', '')), '[[:alpha:]]+'))
#remove the spaces from columnames, rather than explictly renaming
colnames(meta_t) <- unlist(str_extract_all(unlist(str_replace(colnames(meta_t), ' ', '')), '[[:alpha:]]+'))
with_mult <- rates_t %>%
inner_join(meta_t, by = "col") %>%
select(
"UniqueIdentifier",
"SeriesDescription",
"col",
"rate",
"Multiplier"
) %>%
mutate(
Multiplier = as.numeric(Multiplier),
rate = as.numeric(rate),
rate = Multiplier*rate,
duration = str_extract(col, '^\\b[\\w-]+\\b'),
duration_num = str_extract(duration, '\\d+')
)
# This would use the multiper correctly (it's 1 with this set) and begin creating the columns for grouping dimension
#the word function could be used, but why not practice some regex
with_mult$duration_num <- ifelse(is.na(with_mult$duration_num), 1, with_mult$duration_num)
with_mult <- with_mult %>%
rename(trading_date = SeriesDescription, rate_desc = col) %>%
mutate(
duration_num = as.numeric(duration_num),
trading_date = as.Date(trading_date),
rating = str_extract(rate_desc, '\\s\\b[\\w-/]+\\b'),
type = word(rate_desc,3,3)
)
#numeric duration for ordering on the graphs below
with_mult <- with_mult %>%
arrange(duration, rating, type, trading_date) %>%
group_by(duration, rating, type) %>%
mutate(
rate_change = rate - lag(rate),
percent_change = (rate- lag(rate))/lag(rate)
)
# Change and percent change are very valuable in securities analysis
with_mult[!is.na(with_mult$rate), ] %>%
group_by(duration, trading_date) %>%
summarise(avg = mean(rate, NA.rm = TRUE)) %>%
ggplot() + geom_line(aes(x = trading_date, y = avg, group = "identity", color = duration)) +
facet_wrap(~duration) +
labs(y = "Interest Rate", x = "Date", title = "Interest Rate by Tenor") +
scale_x_date(date_breaks = "6 months")
I haven’t been keeping up with my financial news lately, but I’d say there may have been some rate hikes. https://fred.stlouisfed.org/series/FEDFUNDS The commercial paper rate, as one would expect closely tracks the fed funds rate. We see this more for the shorter maturities too.
with_mult %>%
filter(!is.na(rate)) %>%
group_by(duration, duration_num, type, trading_date) %>%
summarise(avg = mean(rate, NA.rm = TRUE)) %>%
ggplot() + geom_line(aes(x = trading_date, y = avg, group = "identity", color = duration)) +
facet_grid(duration_num~type) +
labs(y = "Interest Rate", x = "Date", title = "Interest Rate by Type and Tenor") +
scale_x_date(date_breaks = "6 months")
with_mult %>%
filter(!is.na(rate)) %>%
group_by(duration_num,duration, type, trading_date) %>%
summarise(avg = mean(percent_change, NA.rm = TRUE)) %>%
ggplot() + geom_line(aes(x = trading_date, y = avg, group = "identity", color = duration)) +
facet_grid(duration_num~type) +
labs(y = "Percent Change", x = "Date", title = "Percentage Change by Type and Tenor") +
scale_x_date(date_breaks = "6 months")
risk <- with_mult %>%
filter(!is.na(rate)) %>%
group_by(duration_num, duration, type, rating) %>%
summarise(Stdev = sqrt(var(rate)))
pander(as.data.frame(risk))
| duration_num | duration | type | rating | Stdev |
|---|---|---|---|---|
| 1 | Overnight | Asset-backed | AA | 0.276 |
| 1 | Overnight | Financial | AA | 0.284 |
| 1 | Overnight | Nonfinancial | A2/P2 | 0.2745 |
| 1 | Overnight | Nonfinancial | AA | 0.2775 |
| 7 | 7-Day | Asset-backed | AA | 0.2705 |
| 7 | 7-Day | Financial | AA | 0.2786 |
| 7 | 7-Day | Nonfinancial | A2/P2 | 0.2615 |
| 7 | 7-Day | Nonfinancial | AA | 0.272 |
| 15 | 15-Day | Asset-backed | AA | 0.2618 |
| 15 | 15-Day | Financial | AA | 0.2713 |
| 15 | 15-Day | Nonfinancial | A2/P2 | 0.2512 |
| 15 | 15-Day | Nonfinancial | AA | 0.2725 |
| 30 | 30-Day | Asset-backed | AA | 0.2378 |
| 30 | 30-Day | Financial | AA | 0.2653 |
| 30 | 30-Day | Nonfinancial | A2/P2 | 0.2287 |
| 30 | 30-Day | Nonfinancial | AA | 0.2568 |
| 60 | 60-Day | Asset-backed | AA | 0.2013 |
| 60 | 60-Day | Financial | AA | 0.2287 |
| 60 | 60-Day | Nonfinancial | A2/P2 | 0.2271 |
| 60 | 60-Day | Nonfinancial | AA | 0.2341 |
| 90 | 90-Day | Asset-backed | AA | 0.1528 |
| 90 | 90-Day | Financial | AA | 0.1951 |
| 90 | 90-Day | Nonfinancial | A2/P2 | 0.2427 |
| 90 | 90-Day | Nonfinancial | AA | 0.2196 |
The breaks in the lines likely indicate that these types are less liquid. Through every type, we see the overnight as almost a stepwise function. The standard deviation of returns doesn’t yield what I expected, which would be for the longer maturies to be riskier. They should have higher interest rate risk because more is unknown about future rates. Perhaps a different risk measure would have been better.
David Quarshie
Here we have some accident data in a fairly well structured form. I believe this can be made into two datasets, one creating a relation between every accident and every vehicle involved, and the original dataset. That gives you an additional level of detail.
Here is where we take the vehicle information from the full dataset to be separated out. Some transposing will be required here. The assumption is that contributing_factor_vehicle_n matches up to vehicle_type_coden
accidents_full <- read_csv("https://raw.githubusercontent.com/TheFedExpress/DATA607/master/accidents.csv")
vehicles_start <- accidents_full %>%
select(
c(contributing_factor_vehicle_1:contributing_factor_vehicle_5,
vehicle_type_code1: vehicle_type_code_5,
unique_key
))
vehicles_t <- vehicles_start %>%
gather(
'vehicle_num',
'value',
c(contributing_factor_vehicle_1:contributing_factor_vehicle_5, vehicle_type_code1:vehicle_type_code_5)
)
#this will make 10 rows per 1 original row.
pander(head(as.data.frame(vehicles_t),10))
| unique_key | vehicle_num | value |
|---|---|---|
| 3760103 | contributing_factor_vehicle_1 | Driver Inattention/Distraction |
| 3760411 | contributing_factor_vehicle_1 | Unspecified |
| 3760254 | contributing_factor_vehicle_1 | Unsafe Lane Changing |
| 3761088 | contributing_factor_vehicle_1 | Following Too Closely |
| 3760494 | contributing_factor_vehicle_1 | Other Vehicular |
| 3760409 | contributing_factor_vehicle_1 | Driver Inattention/Distraction |
| 3761933 | contributing_factor_vehicle_1 | Unspecified |
| 3761273 | contributing_factor_vehicle_1 | Driver Inattention/Distraction |
| 3760923 | contributing_factor_vehicle_1 | Failure to Yield Right-of-Way |
| 3760699 | contributing_factor_vehicle_1 | Pedestrian/Bicyclist/Other Pedestrian Error/Confusion |
vehicles_t <- vehicles_t %>%
mutate(
type = str_extract(vehicle_num, '^[[:alpha:]]+_[[:alpha:]]+'),
vehicle_num = str_extract(vehicle_num, '\\d+')
) %>%
arrange(unique_key, vehicle_num, type) %>%
spread( key = type, value = value)
#This creats a vehicle num and a type column(contributing factor vs vehicle type), which sets up the spread function
vehicles_t$contributing_factor <- ifelse(vehicles_t$contributing_factor == "Unspecified", NA, vehicles_t$contributing_factor)
vehicles <- vehicles_t %>%
filter(vehicle_num == 1 | !(is.na(contributing_factor) & is.na(vehicle_type)))
#We want to keep at least the first vehicle and anything else where both values are not NA
head(vehicles_t, 10)
accidents <- accidents_full %>%
select (
-c(contributing_factor_vehicle_1:contributing_factor_vehicle_5,
vehicle_type_code1: vehicle_type_code_5)
)
#The columns we used can be removed from the accidents dataset
head(accidents)
vehicles[!is.na(vehicles$vehicle_type),] %>%
group_by(vehicle_type) %>%
summarise(count = n()) %>%
arrange(desc(count)) %>%
ggplot() + geom_bar(aes(x = vehicle_type, y = count, fill = vehicle_type), stat = 'Identity') +
coord_flip() +
labs(y = "Accidents", x = "Vehicle Type", title = "Accidents by Vehicle Type")
vehicles[!is.na(vehicles$contributing_factor),] %>%
group_by(contributing_factor) %>%
summarise(count = n()) %>%
arrange(desc(count)) %>%
filter(min_rank(desc(count)) <= 10)
vehicles[!is.na(vehicles$vehicle_type),] %>%
filter(!is.na(contributing_factor)) %>%
group_by(vehicle_type, contributing_factor) %>%
summarise(count = n()) %>%
arrange(desc(count)) %>%
filter(min_rank(desc(count)) <= 10)
This is fine, but it would be nice if we knew relative driving times of these vehicles. This is the actuarial concept known as exposures.
SUVs are a really popular vehicle type that seems safe. With our new data model, we can find out if they are safer. This will add a flag for any accident involving an SUV
suv_ids <- vehicles %>%
filter(vehicle_type == "SPORT UTILITY / STATION WAGON") %>%
distinct(unique_key)
with_suv <- accidents %>%
semi_join(suv_ids, by = "unique_key") %>% mutate(has_suv = 1)
no_suv <- accidents %>%
anti_join(suv_ids, by = "unique_key") %>%
mutate(has_suv = 0)
accidents_new <- dplyr::union(with_suv, no_suv)
accidents_new <- accidents_new %>%
mutate(
hour = hour(time)
)
select(accidents, borough) %>%
distinct()
sum(is.na(accidents_new$borough))
## [1] 424
accidents_new[!is.na(accidents_new$borough),] %>%
ggplot(aes(x = borough, fill = borough)) + geom_bar()
accidents_new %>%
ggplot() + geom_bar(aes(x = hour), fill = "darkgreen")
accidents_new %>%
group_by(has_suv) %>%
summarise(
ratio_killed = sum(number_of_persons_killed)/n(),
ratio_injured = sum(number_of_persons_injured)/n()
)
It looks like SUVs might be slightly safer. As with the analysis above, exposures would be really nice here. How much more congested is the Bronx than Staten Island? There are more accidents at peak times and midnight, but are there more accidents per exposure? The uptick at midnight is interesting. As a Bostoner, I would say it’s when bars close, but New York is the city that never sleeps!
Natalie Mollaghan
This dataset is less structred than the previous two and a nice change of pace. It has a couple freeform text fields that must be parsed.
The data field is a character and has some irregularities, to maximize the possiblity of using times a dimensions, we break it up part by part
sharks <- read_csv("https://raw.githubusercontent.com/TheFedExpress/DATA607/master/sharks.csv")
sharks$`Case Number` <- ifelse(sharks$`Case Number` == '0', NA, sharks$`Case Number`)
sharks <- sharks[is.na(sharks$`Case Number`) == 0, 1:22]
sharks_new <- sharks %>%
mutate(
Date = str_extract(Date, '\\d*?[-]*?([[:alpha:]]{3})?[-]*?\\d{4}'),
year = as.numeric(str_extract(Date, '\\d{4}')),
month = str_extract(Date, '[[:alpha:]]{3}'),
month_int = match(month, month.abb),
day = as.numeric(str_extract(Date, '\\d{2}')),
Time = tolower(str_replace_all(Time,'\\s', ''))
)
sharks_new$Time <- ifelse(str_detect(sharks_new$Time,'morning') == 1, '8h00', sharks_new$Time)
sharks_new$Time <- ifelse(str_detect(sharks_new$Time,'night') == 1, '18h00', sharks_new$Time)
sharks_new$Time <- ifelse(str_detect(sharks_new$Time,'afternoon') == 1, '13h00', sharks_new$Time)
#Something is (usually) better than nothing. This will create spikes around those hours, but I believe it's worth it
sharks_new <- sharks_new %>% mutate(Time = str_extract(Time, '\\d{1,2}h\\d{2}'))
sharks_new <- sharks_new %>% mutate(species_temp = tolower(str_replace_all(Species, '[^[:alpha:] ]', ''))) %>%
mutate(
type = str_extract(species_temp, '[[:alpha:]]+ shark'), #assume the word before shark is the species
hour = as.numeric(word(Time, 1, 1, sep = fixed("h"))),
minute = as.numeric(word(Time, 2, 2 ,sep = fixed("h"))),
datetime = make_datetime(year,
ifelse(is.na(month_int) == TRUE,1,month_int),
ifelse(is.na(hour) == TRUE,1,month_int),
ifelse(is.na(minute) == TRUE,1,month_int)
)
)
Included in the species field are the actual species, height, and weight of the sharks. There is a lot of info we can get by doing this. We also standardize the area and activity fields
sharks_new <- sharks_new %>%
mutate(
species_temp = tolower(str_replace_all(Species, "'", 'f')), #removing characters makes this so much easier
species_temp = tolower(str_replace_all(species_temp, '[^[:alnum:]\\.]', '')),
length_temp = str_extract(species_temp, '\\d+(\\.\\d+)?[a-z](to\\d+(\\.\\d+)?[a-z])?'), #optionals to allow for decimals and ranges
length_lower = str_extract(length_temp, '\\d+(\\.\\d+)?[a-z]'),
length_upper_temp = str_sub(length_temp, str_locate(length_temp, 'to')[,2] + 1),
length_upper = str_extract(length_upper_temp, '\\d+(\\.\\d+)?[a-z]')
)
convert <- function(x){
if (!is.na(str_extract(x, 'm'))){
l = as.numeric(str_extract(x,'\\d+(\\.\\d+)?'))
return(l)
}else if (!is.na(str_extract(x, 'f'))){
l = .3048*as.numeric(str_extract(x,'\\d+(\\.\\d+)?'))
return(l)
}else if(!is.na(str_extract(x, 'i'))){
l = .0254*as.numeric(str_extract(x,'\\d+(\\.\\d+)?'))
return(l)
}
else{
return(NA)
}
}
sharks_new$lower_meters <- unlist(sapply(sharks_new$length_lower, convert))
sharks_new$upper_meters <- unlist(sapply(sharks_new$length_upper, convert))
sharks_new$length_meters <- rowMeans(cbind(sharks_new$lower_meters, sharks_new$upper_meters), na.rm = TRUE)
#almost the same as above, but for weight measurements
sharks_new <- sharks_new %>%
mutate(
species_temp = tolower(str_replace_all(Species, "'", 'f')),
species_temp = tolower(str_replace_all(species_temp, '[^[:alnum:]]', '')),
weight_temp = str_extract(species_temp, '\\d+(\\.\\d+)?[kl]{1}[a-z]+(to\\d+(\\.\\d+)?[a-z])?'),
weight_lower = str_extract(weight_temp, '\\d+(\\.\\d+)?[a-z]'),
weight_upper_temp = str_sub(weight_temp, str_locate(weight_temp, 'to')[,2] + 1),
weight_upper = str_extract(weight_upper_temp, '\\d+(\\.\\d+)?[a-z]')
)
convert <- function(x){
if (!is.na(str_extract(x, 'k'))){
l = as.numeric(str_extract(x,'\\d+(\\.\\d+)?'))
return(l)
}else if (!is.na(str_extract(x, 'l'))){
l = .453592*as.numeric(str_extract(x,'\\d+(\\.\\d+)?'))
return(l)
}else{
return(NA)
}
}
sharks_new$lower_kg <- unlist(sapply(sharks_new$weight_lower, convert))
sharks_new$upper_kg <- unlist(sapply(sharks_new$weight_upper, convert))
sharks_new$weight_kg <- rowMeans(cbind(sharks_new$upper_kg, sharks_new$lower_kg), na.rm = TRUE)
sharks_new <- sharks_new %>%
mutate(
verb = tolower(str_extract(Activity,'[[:alpha:]]+ing')), #assume every activity ends in "ing"
area_stand = tolower(str_extract(Area, '[[:alpha:]-\\. ]+'))
)
sharks_new %>%
select(Species, type, length_meters, weight_kg) %>%
head()
sharks_new$length_meters <- ifelse(sharks_new$length_meters > 20, NA, sharks_new$length_meters)
#the method we used failed when measurements were in mm. Rather than write some spaghetti code to deal with it, we remove those observations.
sharks_new <- rename(sharks_new, fatal = `Fatal (Y/N)`)
sharks_new <- sharks_new %>% mutate(fatal = tolower(fatal)) %>%
mutate(fatal = str_extract(fatal, '[yn]'))
sharks_final <- sharks_new %>%
select(
-c(Year, length_temp:length_upper, weight_temp:weight_upper, species_temp) #we don't need these anymore
)
sharks_final$weight_kg <- ifelse(is.nan(sharks_final$weight_kg), NA, sharks_final$weight_kg)
sharks_final$length_meters <- ifelse(is.nan(sharks_final$length_meters), NA, sharks_final$length_meters)
head(sharks_final,10)
We look to compare the size of sharks in fatal attacks vs those in nonfatal attacks, find the safest areas, and the safest activities
sharks_new[!is.na(sharks_new$fatal),] %>%
ggplot() + geom_boxplot(mapping = aes(x = fatal, y = length_meters, color = fatal)) +
labs(title = "Comparison of Shark Sizes in Fatal Attacks", y = "Length (m)", x = "Was Fatal")
sharks_new %>%
group_by(area_stand) %>%
summarize(attacks = n()) %>%
arrange(desc(attacks)) %>%
filter(min_rank(desc(attacks)) <= 15 & !is.na(area_stand)) %>%
ggplot() + geom_bar(mapping= aes(x = area_stand, y = attacks, fill = area_stand), stat = "Identity") +
coord_flip() + labs(x = "Area", title = "Top 15 Dangerous Areas")
sharks_new %>%
group_by(verb) %>%
summarize(attacks = n()) %>%
arrange(desc(attacks)) %>%
filter(min_rank(desc(attacks)) <= 15 & !is.na(verb)) %>%
ggplot() + geom_bar(mapping= aes(x = verb, y = attacks, fill = verb), stat = "Identity") +
coord_flip() + labs(x = "Activity", title = "Top 15 Dangerous Activities")
It looks as though bigger sharks are deadlier! I am also not swimming in Florida anytime soon!