Tidy Data - Project 2 Part 1 - Bank Stocks

Jeff Shamp

2020-03-08

Using tidyr, dplyr, ggplot2, stringr, RCurl

Tidy Data

I will be using the data source from data I posted since I never had to tidy it (thanks to python/pandas).

Load Data

b<-getURL(
"https://raw.githubusercontent.com/Shampjeff/cuny_msds/master/DATA_607/data/banks.csv")
df_banks<-read.csv(text=b, stringsAsFactors = F)

Let’s look initial state of the data.

DT::datatable(df_banks[1:5,1:4], 
         extensions = c('FixedColumns',"FixedHeader"),
          options = list(scrollX = TRUE, 
                         paging=TRUE,
                         fixedHeader=TRUE))

Got our work cut out for us. The below function cycles through the columns and renames each selected column with a name that is located in one of the rows. Many multi-index or nested column structures will have a name buried in a row value. The function takes a dataframe df, the row index where the correct name is located row_idx, and the column where to start the name changing col_start. name_join is an option to keep the letters of a column name and remove the xxx.1 naming convention that R uses for duplicate column names.

rename.columns<-function(df, row_idx, col_start, name_join=F){
  for (i in col_start:ncol(df)){                                
    prefix<-c(str_remove(names(df[i]), "[.]\\d"), df[row_idx,i])
    join_name<-paste(prefix, collapse="_")  
    if (name_join == FALSE){
      names(df)[names(df) == names(df[i])]<-df[row_idx,i]
    } else {
      names(df)[names(df) == names(df[i])]<-join_name
    }
  }
  return(df)
}

Bank Stocks 2006

This is data regarding bank stocks from 2006 so it can give a nice window into how these banks were doing before the crash of 2007/08. In this data set there are only six banks listed BAC, C, GS, MS, JPM, and WFC. Each bank has the same varibles associated to them; open, high, low, close, and volume. There is also a date for each bank.

df_banks<-rename.columns(df_banks, 1, 2, name_join = TRUE)
df_banks<-df_banks %>%
  slice(3:dim(df_banks)[1]) %>%             # drop unnecessary lines
  rename(Date = Bank.Ticker) %>%
  pivot_longer(-Date, names_to = "bank",
               values_to = "value") %>%      # reshape banks to one column
  separate(bank, c("bank", "status"),
           sep = "_") %>%                    # separate status (open, close, etc.)
  mutate_at(.vars = vars(value),
            .funs = as.numeric) %>%          # Change data types
  mutate_at(.vars = vars(Date),
            .funs = as.Date)

DT::datatable(df_banks,
         extensions = c('FixedColumns',"FixedHeader"),
          options = list(scrollX = TRUE,
                         paging=TRUE,
                         fixedHeader=TRUE))

Analysis

The discussion piece talks about viewing the closing price as a function of time to see who fared better during that time period. The results are below.

df_banks %>%
  filter(status=="Close") %>%
  ggplot() +
  geom_line(aes(x=Date, y=value, col=bank)) +
  labs(x="Date", y="Closing Value", title = "Bank Stock Closing Value 2006 - 2016")

Brutal. Citi absolutely tanked. Let’s look between 01-01-2008 and 01-01-2010.

df_banks %>%
  filter(status=="Close") %>%
  filter(Date >="2008-01-01" & Date <= "2010-01-01") %>%
  ggplot() +
  geom_line(aes(x=Date, y=value, col=bank)) +
  labs(x="Date", y="Closing Value", 
       title = "Bank Stock Closing Value 2008 - 2010")

Interesting, by the time Citi hit rock bottom (and never regained!) Goldman Sacks had already recovered most of its pre-collapse value. For all the others listed this time period was not as catastrophic as it seemed. Next, let’s look at the aggregated data for median closing price in 6 month intervals. This will give us a more high level view of how these stocks fared during this time period. It may also give us less noise and more separation between banks for better comparison.

inner_join(inner_join(df_banks %>%        # Calculate a 6 month median
  filter(status=="Close") %>%
  filter(Date >='2008-01-01' & Date <='2008-07-01') %>%
  group_by(bank) %>%
  summarise('2008-07-01' = median(value)),
df_banks %>%
  filter(status=="Close") %>%
  filter(Date >='2008-07-01' & Date <='2008-12-31') %>%
  group_by(bank) %>%
  summarise('2008-12-31' = median(value)), by = "bank"),
inner_join(df_banks %>%
  filter(status=="Close") %>%
  filter(Date >='2009-01-01' & Date <='2009-07-01') %>%
  group_by(bank) %>%
  summarise('2009-07-01' = median(value)),
df_banks %>%
  filter(status=="Close") %>%
  filter(Date >='2009-07-01' & Date <='2009-12-31') %>%
  group_by(bank) %>%
  summarise('2009-12-31' = median(value)), by = "bank"), by = "bank") %>%
  
  pivot_longer(-bank,                    # Tidy the median prices
               names_to = "date",
               values_to = "closing_price") %>%
  mutate_at(.vars = vars(date),
            .funs = as.Date) %>%         # Cast the 6 month end as a date
  
  ggplot() + 
  geom_line(aes(x=date, y=closing_price, col=bank)) +
  labs(x="Date",
       y="Closing Value", 
       title = "Median Bank Stock Closing Value 2008 - 2010")

With some more separation, we can see that Bank of America actually got hit pretty hard. With a 6 month median of around $45 per share in the summer of 2008 to around $12 per share by the first day of 2010, that substantial. JPM, MS, and WFC were not largely affected in this time period according to median closing price. We also see again that Goldman Sacks completely recovered in median value in the two year interval. We see that yes Citi collapsed from around $230 per share to arond $30 per share in two years, which represents a 7x decline in value, but Citi went from 5x its competitors to being aligned with them in median stock price. One could make the arguement that Citi’s price collapse was more akin to a return to normalcy in price relative to peer banking institutions.