Data 607 Project 2

Objectives

Transform and tidy 3 datasets Perform Analysis on said datasets

Dataset 1 Commercial Paper

(Albina Gallyavova)

Here we have some rate history on commercial paper, a short term debt instrument. Our goal is to create a table that can summarize it by various dimensions, while keeping some of the metadata present in the csv file

Initialization

library(readr)
library(stringr)
library(tidyverse)
library(lubridate)
library(pander)

Read in and transpose data

rates <- read_csv("https://raw.githubusercontent.com/TheFedExpress/DATA607/master/rates.csv")
meta <- read_csv("https://raw.githubusercontent.com/TheFedExpress/DATA607/master/rates.csv", n_max = 4)

rates <- rates [6:(nrow(rates)), ]
cols <-ncol(meta)

meta_t <-  meta %>% 
  gather( 'col', 'desc', 2:25) %>% 
  spread('Series Description', 'desc')
# This sets up a joining the metadata to the actual data
rates_t <- rates %>% gather('col', 'rate', 2:25)
colnames(rates_t) <- unlist(str_extract_all(unlist(str_replace(colnames(rates_t), ' ', '')), '[[:alpha:]]+'))
#remove the spaces from columnames, rather than explictly renaming
colnames(meta_t) <- unlist(str_extract_all(unlist(str_replace(colnames(meta_t), ' ', '')), '[[:alpha:]]+'))

Mutate the data

with_mult <- rates_t %>% 
  inner_join(meta_t, by = "col") %>% 
  select(
    "UniqueIdentifier",
    "SeriesDescription",
    "col", 
    "rate", 
    "Multiplier"
  ) %>%
  mutate(
    Multiplier = as.numeric(Multiplier), 
    rate = as.numeric(rate),
    rate = Multiplier*rate,
    duration = str_extract(col, '^\\b[\\w-]+\\b'),
    duration_num = str_extract(duration, '\\d+')
  )
# This would use the multiper correctly (it's 1 with this set) and begin creating the columns for grouping dimension
#the word function could be used, but why not practice some regex

with_mult$duration_num <- ifelse(is.na(with_mult$duration_num), 1, with_mult$duration_num)

with_mult <- with_mult %>%
  rename(trading_date = SeriesDescription, rate_desc = col) %>%
  mutate(
    duration_num = as.numeric(duration_num),
    trading_date = as.Date(trading_date),
    rating = str_extract(rate_desc, '\\s\\b[\\w-/]+\\b'),
    type = word(rate_desc,3,3)
  )
#numeric duration for ordering on the graphs below

with_mult <- with_mult %>% 
  arrange(duration, rating, type, trading_date) %>%
  group_by(duration, rating, type) %>%
  mutate(
      rate_change = rate - lag(rate),
      percent_change = (rate- lag(rate))/lag(rate)
  )
# Change and percent change are very valuable in securities analysis

Analysis

with_mult[!is.na(with_mult$rate), ] %>% 
  group_by(duration, trading_date) %>%
  summarise(avg = mean(rate, NA.rm = TRUE)) %>%
    ggplot() + geom_line(aes(x = trading_date, y = avg, group = "identity", color = duration)) +
      facet_wrap(~duration) +
      labs(y = "Interest Rate", x = "Date", title = "Interest Rate by Tenor") +
      scale_x_date(date_breaks  = "6 months")

I haven’t been keeping up with my financial news lately, but I’d say there may have been some rate hikes. https://fred.stlouisfed.org/series/FEDFUNDS The commercial paper rate, as one would expect closely tracks the fed funds rate. We see this more for the shorter maturities too.

with_mult %>% 
  filter(!is.na(rate)) %>%
  group_by(duration, duration_num, type, trading_date) %>%
  summarise(avg = mean(rate, NA.rm = TRUE)) %>%
  ggplot() + geom_line(aes(x = trading_date, y = avg, group = "identity", color = duration)) +
  facet_grid(duration_num~type) +
  labs(y = "Interest Rate", x = "Date", title = "Interest Rate by Type and Tenor") +
  scale_x_date(date_breaks  = "6 months")

with_mult %>% 
  filter(!is.na(rate)) %>%
  group_by(duration_num,duration, type, trading_date) %>%
  summarise(avg = mean(percent_change, NA.rm = TRUE)) %>%
  ggplot() + geom_line(aes(x = trading_date, y = avg, group = "identity", color = duration)) +
  facet_grid(duration_num~type) +
  labs(y = "Percent Change", x = "Date", title = "Percentage Change by Type and Tenor") +
  scale_x_date(date_breaks  = "6 months")

risk <- with_mult %>% 
  filter(!is.na(rate)) %>%
  group_by(duration_num, duration, type, rating) %>%
  summarise(Stdev = sqrt(var(rate)))
pander(as.data.frame(risk))

duration_num	duration	type	rating	Stdev
1	Overnight	Asset-backed	AA	0.276
1	Overnight	Financial	AA	0.284
1	Overnight	Nonfinancial	A2/P2	0.2745
1	Overnight	Nonfinancial	AA	0.2775
7	7-Day	Asset-backed	AA	0.2705
7	7-Day	Financial	AA	0.2786
7	7-Day	Nonfinancial	A2/P2	0.2615
7	7-Day	Nonfinancial	AA	0.272
15	15-Day	Asset-backed	AA	0.2618
15	15-Day	Financial	AA	0.2713
15	15-Day	Nonfinancial	A2/P2	0.2512
15	15-Day	Nonfinancial	AA	0.2725
30	30-Day	Asset-backed	AA	0.2378
30	30-Day	Financial	AA	0.2653
30	30-Day	Nonfinancial	A2/P2	0.2287
30	30-Day	Nonfinancial	AA	0.2568
60	60-Day	Asset-backed	AA	0.2013
60	60-Day	Financial	AA	0.2287
60	60-Day	Nonfinancial	A2/P2	0.2271
60	60-Day	Nonfinancial	AA	0.2341
90	90-Day	Asset-backed	AA	0.1528
90	90-Day	Financial	AA	0.1951
90	90-Day	Nonfinancial	A2/P2	0.2427
90	90-Day	Nonfinancial	AA	0.2196

The breaks in the lines likely indicate that these types are less liquid. Through every type, we see the overnight as almost a stepwise function. The standard deviation of returns doesn’t yield what I expected, which would be for the longer maturies to be riskier. They should have higher interest rate risk because more is unknown about future rates. Perhaps a different risk measure would have been better.

Dataset 2 Accidents

David Quarshie

Here we have some accident data in a fairly well structured form. I believe this can be made into two datasets, one creating a relation between every accident and every vehicle involved, and the original dataset. That gives you an additional level of detail.

Create the vehicles dataset

Here is where we take the vehicle information from the full dataset to be separated out. Some transposing will be required here. The assumption is that contributing_factor_vehicle_n matches up to vehicle_type_coden

accidents_full <- read_csv("https://raw.githubusercontent.com/TheFedExpress/DATA607/master/accidents.csv")
vehicles_start <- accidents_full %>% 
              select(
                c(contributing_factor_vehicle_1:contributing_factor_vehicle_5, 
                vehicle_type_code1: vehicle_type_code_5,
                unique_key
              ))
vehicles_t <- vehicles_start %>% 
  gather(
    'vehicle_num', 
    'value', 
    c(contributing_factor_vehicle_1:contributing_factor_vehicle_5, vehicle_type_code1:vehicle_type_code_5)
  )
#this will make 10 rows per 1 original row.
pander(head(as.data.frame(vehicles_t),10))

unique_key	vehicle_num	value
3760103	contributing_factor_vehicle_1	Driver Inattention/Distraction
3760411	contributing_factor_vehicle_1	Unspecified
3760254	contributing_factor_vehicle_1	Unsafe Lane Changing
3761088	contributing_factor_vehicle_1	Following Too Closely
3760494	contributing_factor_vehicle_1	Other Vehicular
3760409	contributing_factor_vehicle_1	Driver Inattention/Distraction
3761933	contributing_factor_vehicle_1	Unspecified
3761273	contributing_factor_vehicle_1	Driver Inattention/Distraction
3760923	contributing_factor_vehicle_1	Failure to Yield Right-of-Way
3760699	contributing_factor_vehicle_1	Pedestrian/Bicyclist/Other Pedestrian Error/Confusion

vehicles_t <- vehicles_t %>% 
  mutate(
    type = str_extract(vehicle_num, '^[[:alpha:]]+_[[:alpha:]]+'),
    vehicle_num = str_extract(vehicle_num, '\\d+')
  ) %>%
  arrange(unique_key, vehicle_num, type) %>%
  spread( key = type, value = value)
#This creats a vehicle num and a type column(contributing factor vs vehicle type), which sets up the spread function
vehicles_t$contributing_factor <- ifelse(vehicles_t$contributing_factor == "Unspecified", NA, vehicles_t$contributing_factor)

vehicles <- vehicles_t %>% 
  filter(vehicle_num == 1 | !(is.na(contributing_factor) & is.na(vehicle_type)))
#We want to keep at least the first vehicle and anything else where both values are not NA
head(vehicles_t, 10)

accidents <- accidents_full %>% 
  select (
     -c(contributing_factor_vehicle_1:contributing_factor_vehicle_5, 
      vehicle_type_code1: vehicle_type_code_5)
    )
#The columns we used can be removed from the accidents dataset
head(accidents)

Analysis of vehicles data

vehicles[!is.na(vehicles$vehicle_type),] %>% 
  group_by(vehicle_type) %>% 
  summarise(count = n()) %>%
  arrange(desc(count)) %>%
  ggplot() + geom_bar(aes(x = vehicle_type, y = count, fill = vehicle_type), stat = 'Identity') +
    coord_flip() +
    labs(y = "Accidents", x = "Vehicle Type", title = "Accidents by Vehicle Type")

vehicles[!is.na(vehicles$contributing_factor),] %>% 
  group_by(contributing_factor) %>% 
  summarise(count = n()) %>% 
  arrange(desc(count)) %>%
  filter(min_rank(desc(count)) <= 10)

vehicles[!is.na(vehicles$vehicle_type),] %>% 
  filter(!is.na(contributing_factor)) %>%
  group_by(vehicle_type, contributing_factor) %>% 
  summarise(count = n()) %>% 
  arrange(desc(count)) %>%
  filter(min_rank(desc(count)) <= 10)

This is fine, but it would be nice if we knew relative driving times of these vehicles. This is the actuarial concept known as exposures.

Add has_suv flag

SUVs are a really popular vehicle type that seems safe. With our new data model, we can find out if they are safer. This will add a flag for any accident involving an SUV

suv_ids <- vehicles %>% 
  filter(vehicle_type == "SPORT UTILITY / STATION WAGON") %>%
  distinct(unique_key)

with_suv <- accidents %>% 
  semi_join(suv_ids, by = "unique_key") %>% mutate(has_suv = 1)
  no_suv <- accidents %>% 
    anti_join(suv_ids, by = "unique_key") %>% 
    mutate(has_suv = 0)

accidents_new <- dplyr::union(with_suv, no_suv)
accidents_new <- accidents_new %>%
    mutate( 
      hour = hour(time)
    )

Analysis of accidents dataset

select(accidents, borough) %>%
  distinct()

sum(is.na(accidents_new$borough))

## [1] 424

accidents_new[!is.na(accidents_new$borough),] %>%
  ggplot(aes(x = borough, fill = borough)) + geom_bar()

accidents_new %>%
  ggplot() + geom_bar(aes(x = hour), fill = "darkgreen")

accidents_new %>% 
  group_by(has_suv) %>%
  summarise(
    ratio_killed = sum(number_of_persons_killed)/n(),
    ratio_injured = sum(number_of_persons_injured)/n()
  )

It looks like SUVs might be slightly safer. As with the analysis above, exposures would be really nice here. How much more congested is the Bronx than Staten Island? There are more accidents at peak times and midnight, but are there more accidents per exposure? The uptick at midnight is interesting. As a Bostoner, I would say it’s when bars close, but New York is the city that never sleeps!

Dataset 3 Sharks

Natalie Mollaghan

This dataset is less structred than the previous two and a nice change of pace. It has a couple freeform text fields that must be parsed.

Parse out date and time fields

The data field is a character and has some irregularities, to maximize the possiblity of using times a dimensions, we break it up part by part

sharks <- read_csv("https://raw.githubusercontent.com/TheFedExpress/DATA607/master/sharks.csv")
sharks$`Case Number` <- ifelse(sharks$`Case Number` == '0', NA, sharks$`Case Number`)
sharks <- sharks[is.na(sharks$`Case Number`) == 0, 1:22]

sharks_new <- sharks %>% 
  mutate(
    Date = str_extract(Date, '\\d*?[-]*?([[:alpha:]]{3})?[-]*?\\d{4}'),
    year = as.numeric(str_extract(Date, '\\d{4}')), 
    month = str_extract(Date, '[[:alpha:]]{3}'), 
    month_int = match(month, month.abb),
    day = as.numeric(str_extract(Date, '\\d{2}')),
    Time = tolower(str_replace_all(Time,'\\s', ''))
  )

sharks_new$Time <-  ifelse(str_detect(sharks_new$Time,'morning') == 1, '8h00', sharks_new$Time)
sharks_new$Time <-  ifelse(str_detect(sharks_new$Time,'night') == 1, '18h00', sharks_new$Time)
sharks_new$Time <-  ifelse(str_detect(sharks_new$Time,'afternoon') == 1, '13h00', sharks_new$Time)
#Something is (usually) better than nothing.  This will create spikes around those hours, but I believe it's worth it
sharks_new <- sharks_new %>% mutate(Time = str_extract(Time, '\\d{1,2}h\\d{2}'))
sharks_new <- sharks_new %>% mutate(species_temp = tolower(str_replace_all(Species, '[^[:alpha:] ]', ''))) %>%
  mutate(
    type = str_extract(species_temp, '[[:alpha:]]+ shark'), #assume the word before shark is the species
    hour = as.numeric(word(Time, 1, 1, sep = fixed("h"))),
    minute = as.numeric(word(Time, 2, 2 ,sep = fixed("h"))),
    datetime = make_datetime(year, 
                             ifelse(is.na(month_int) == TRUE,1,month_int), 
                             ifelse(is.na(hour) == TRUE,1,month_int), 
                             ifelse(is.na(minute) == TRUE,1,month_int)
      )
    )

Parse out the species field

Included in the species field are the actual species, height, and weight of the sharks. There is a lot of info we can get by doing this. We also standardize the area and activity fields

sharks_new <- sharks_new %>% 
  mutate(
    species_temp = tolower(str_replace_all(Species, "'", 'f')), #removing characters makes this so much easier
    species_temp = tolower(str_replace_all(species_temp, '[^[:alnum:]\\.]', '')),
    length_temp = str_extract(species_temp,  '\\d+(\\.\\d+)?[a-z](to\\d+(\\.\\d+)?[a-z])?'), #optionals to allow for decimals and ranges
    length_lower = str_extract(length_temp, '\\d+(\\.\\d+)?[a-z]'),
    length_upper_temp =  str_sub(length_temp, str_locate(length_temp, 'to')[,2] + 1),
    length_upper = str_extract(length_upper_temp, '\\d+(\\.\\d+)?[a-z]')
  )

convert <- function(x){
  if (!is.na(str_extract(x, 'm'))){
    l = as.numeric(str_extract(x,'\\d+(\\.\\d+)?'))
    return(l)
  }else if (!is.na(str_extract(x, 'f'))){
    l = .3048*as.numeric(str_extract(x,'\\d+(\\.\\d+)?'))
    return(l)
  }else if(!is.na(str_extract(x, 'i'))){ 
    l = .0254*as.numeric(str_extract(x,'\\d+(\\.\\d+)?'))
  return(l)
  }
  else{
    return(NA)
  }
}

sharks_new$lower_meters <- unlist(sapply(sharks_new$length_lower, convert))
sharks_new$upper_meters <- unlist(sapply(sharks_new$length_upper, convert))
sharks_new$length_meters <- rowMeans(cbind(sharks_new$lower_meters, sharks_new$upper_meters), na.rm = TRUE)

#almost the same as above, but for weight measurements
sharks_new <- sharks_new %>% 
  mutate(
    species_temp = tolower(str_replace_all(Species, "'", 'f')),
    species_temp = tolower(str_replace_all(species_temp, '[^[:alnum:]]', '')),
    weight_temp = str_extract(species_temp,  '\\d+(\\.\\d+)?[kl]{1}[a-z]+(to\\d+(\\.\\d+)?[a-z])?'), 
    weight_lower = str_extract(weight_temp, '\\d+(\\.\\d+)?[a-z]'),
    weight_upper_temp =  str_sub(weight_temp, str_locate(weight_temp, 'to')[,2] + 1),
    weight_upper = str_extract(weight_upper_temp, '\\d+(\\.\\d+)?[a-z]')
  )

convert <- function(x){
  if (!is.na(str_extract(x, 'k'))){
    l = as.numeric(str_extract(x,'\\d+(\\.\\d+)?'))
    return(l)
  }else if (!is.na(str_extract(x, 'l'))){
    l = .453592*as.numeric(str_extract(x,'\\d+(\\.\\d+)?'))
    return(l)
  }else{
    return(NA)
  }
}
sharks_new$lower_kg <- unlist(sapply(sharks_new$weight_lower, convert))
sharks_new$upper_kg <- unlist(sapply(sharks_new$weight_upper, convert))
sharks_new$weight_kg <- rowMeans(cbind(sharks_new$upper_kg, sharks_new$lower_kg), na.rm = TRUE)
sharks_new <- sharks_new %>% 
  mutate(
    verb = tolower(str_extract(Activity,'[[:alpha:]]+ing')), #assume every activity ends in "ing"
    area_stand = tolower(str_extract(Area, '[[:alpha:]-\\. ]+'))
  )
sharks_new %>% 
  select(Species, type, length_meters, weight_kg) %>%
  head()

sharks_new$length_meters <- ifelse(sharks_new$length_meters > 20, NA, sharks_new$length_meters)
#the method we used failed when measurements were in mm.  Rather than write some spaghetti code to deal with it, we remove those observations.
sharks_new <- rename(sharks_new, fatal = `Fatal (Y/N)`)
sharks_new <- sharks_new %>% mutate(fatal = tolower(fatal)) %>%
  mutate(fatal = str_extract(fatal, '[yn]'))


sharks_final <- sharks_new %>%
  select(
    -c(Year, length_temp:length_upper, weight_temp:weight_upper, species_temp) #we don't need these anymore
  )
sharks_final$weight_kg <- ifelse(is.nan(sharks_final$weight_kg), NA, sharks_final$weight_kg)
sharks_final$length_meters <- ifelse(is.nan(sharks_final$length_meters), NA, sharks_final$length_meters)
head(sharks_final,10)

Analysis

We look to compare the size of sharks in fatal attacks vs those in nonfatal attacks, find the safest areas, and the safest activities

sharks_new[!is.na(sharks_new$fatal),] %>%
  ggplot() + geom_boxplot(mapping = aes(x = fatal, y = length_meters, color = fatal)) +
    labs(title = "Comparison of Shark Sizes in Fatal Attacks", y = "Length (m)", x = "Was Fatal")

sharks_new %>%
  group_by(area_stand) %>%
  summarize(attacks = n()) %>%
  arrange(desc(attacks)) %>%
  filter(min_rank(desc(attacks)) <= 15 & !is.na(area_stand)) %>%
  ggplot() + geom_bar(mapping= aes(x = area_stand, y = attacks, fill = area_stand), stat = "Identity") +
    coord_flip() + labs(x = "Area", title = "Top 15 Dangerous Areas")

sharks_new %>%
  group_by(verb) %>%
  summarize(attacks = n()) %>%
  arrange(desc(attacks)) %>%
  filter(min_rank(desc(attacks)) <= 15 & !is.na(verb)) %>%
  ggplot() + geom_bar(mapping= aes(x = verb, y = attacks, fill = verb), stat = "Identity") +
    coord_flip() + labs(x = "Activity", title = "Top 15 Dangerous Activities")

It looks as though bigger sharks are deadlier! I am also not swimming in Florida anytime soon!

Data 607 Project 2

Peter Goodridge

October 8, 2017

Objectives

Dataset 1 Commercial Paper

Initialization

Read in and transpose data

Mutate the data

Analysis

Dataset 2 Accidents

Create the vehicles dataset

Analysis of vehicles data

Add has_suv flag

Analysis of accidents dataset

Dataset 3 Sharks

Parse out date and time fields

Parse out the species field

Analysis