Overview

This project was created to form an ETL pipeline for an API. I accessed financial data API from link and built out a MYSQL database. R and R libraries were used as my interface with the API and the MYSQL database. All packages were listed in the first section Library Import. The report was created as an accompanying html document and will live in my GitHub, the report may be periodically updated.

I decided to explore US regional banks. I built a function to access the API and convert JSON data to a R Dataframe. This function also cleans the data and preps it for our SQL database. Another function was used to insert that R Dataframe into a MYSQL database. After the data was inserted into SQL, SQL queries were executed within R, and the query results were displayed in a datatable in R. Datatables are convenient, as they give users more flexibility than traditional print statements. SQL Windows functions were then used to manipulate the SQL data to explore aggregated behaviors. Some of those functions were generalized to handle any numeric column. Generally, the functions are there to show an understanding of SQL and how to perform data manipulation.

Build out functions

In the section below, the API is accessed and the JSON data is converted and inserted into a MYSQL database. After reading through all the subsections below, please see windows functions section to the right for more exploration.

Functions

json_to_df

  • Accesses our financial data API and returns a cleaned DF
  • Inputs
    • Tag(char)
      • Stock ticker

Build_table

  • Function takes the result of json_to_df, a dataframe with stock financial information, and runs a query to create an SQL table from that dataframe
  • Returns an error if table already exists, prompting user to rename table or run the update_table function
  • To prevent duplicate data, the function creates a primary key based on date+company tag
  • Inputs
    • Tag(char)
      • Stock ticker
    • Table_name(char)
      • User choice for MYSQL table name

Update table

  • Function loops over a list of stock market tags and inserts them into a previously created MYSQL database
  • Function has error handling so that it will refuse to add a dataframe from the loop if it detects duplicate data in that dataframe
    • If dataframe has duplicates an error will print to screen that “data for {tag} was not added” and the loop will move onto next item in list
    • If dataframe is inserted successfully, function will print to screen to indicate is was updated
  • Inputs
    • tags (list of chars)
    • table_name(char)
      • User choice for MYSQL table name
### Function accesses financial data API and returns a dataframe ready to be inserted into SQL 
### Function takes companies stock tag as an input string   
json_to_df <- function(tag)
    {
    ## Build url
    url <- paste("https://financialmodelingprep.com/api/v3/financials/income-statement/",tag,"?period=quarter",sep="")
    headers = c('Upgrade-Insecure-Requests' = '1')
    params = list(`datatype` = 'json')
    
    ## Make request
    result <- GET(url = url, httr::add_headers(.headers=headers), query = params)
    result<- rawToChar(result$content)
    df <- as.data.frame(fromJSON(result)[2])
    
    ## Fix table column names- remove financials. and special char "."
    colnames(df) <- gsub("financials.|\\.","",colnames(df))
    
    ## Convert Date to datetime/ rename date as reportdate
    df$Report_Date <- as.Date(df$date, '%Y-%m-%d')
    df <-  df %>%
        select(-date)
    
    ## Build tag column to identify stock ticker
    df$Company <- tag
    
    ## Build primary key column convert financial data to numeric
    df$Id <- paste(as.character(df$Report_Date),df$Company,sep="-")
    cols.num <- colnames(df)[1:31]
    df[,cols.num] <- sapply(df[cols.num],as.numeric)
    
    ## Data check for NA-
    table(is.na(df))
    return(df)
}


### Function initalizes table creation in mysql
### takes stock tag(chracter), tablename(character)
### If table already exists, will tell you to use different name or use update function instead
build_table <- function(tag,tablename){
    
        ## import df from json api call function 
        df <- json_to_df(tag)
    
        ## grab credentials from credential file
        db_credentials<-"C:\\Users\\justin\\Desktop\\xmedia.cnf"
        my_sql_db<-"xmedia"
        
        ## make connection
        my_conn<-dbConnect(RMariaDB::MariaDB(),
                           default.file=db_credentials,
                           group=my_sql_db)

        ## Build table from df
        tryCatch(dbWriteTable(my_conn, value = df, 
                              name = tablename,
                              overwrite =FALSE,
                              row.names = FALSE) ,error= function(e){
                              print("table can not be overwritten and already exists. Please use update table function or change name")})
        
        ## Set primary key to Companytag+Date
        res <- dbSendQuery(my_conn, paste("ALTER TABLE",tablename,"ADD CONSTRAINT websites_pk
                                 PRIMARY KEY (`Id`(40)) ;"))
        ## Disconnect
        dbClearResult(res)
        dbDisconnect(my_conn)
}


### loops through list of stock tags and updates SQL DB 
### Will not update db if any of stock data in new queried df already exists in SQL DB
update_table <- function(tags,tablename){
    for (tag in tags){
        
        ## set error checker
        skip_to_next <- FALSE
        
        ## import df from json api call function 
        df <- json_to_df(tag)
        
        ## grab credentials from credential file
        db_credentials<-"C:\\Users\\justin\\Desktop\\xmedia.cnf"
        my_sql_db<-"xmedia"
        
        ## make connection
        my_conn<-dbConnect(RMariaDB::MariaDB(),
                           default.file=db_credentials,
                           group=my_sql_db) 
        
        # insert df into SQL.  Catches primary key conflicts(duplicate data), prints stock wasnt updated, and moves on in loop
        tryCatch(dbWriteTable(my_conn, value = df, 
                          name = tablename, 
                          overwrite= FALSE,   
                          append = TRUE,                         
                          row.names = FALSE),error= function(e){skip_to_next <<- TRUE})
         if(skip_to_next) { print(paste("Company",tag, "data already exists in DB"))
                            gc()
                            dbDisconnect(my_conn)
                            next }
        gc()
        dbDisconnect(my_conn)
        print(paste("db was updated correctly with: ",tag ))
    }
}

Call Functions and Build Database

  • Create a list of banks to explore
  • Run functions to insert that list into MYSQL database
## [1] "db was updated correctly with:  WFC"
## [1] "db was updated correctly with:  PNC"
## [1] "db was updated correctly with:  BBT"
## [1] "db was updated correctly with:  STI"
## [1] "db was updated correctly with:  KEY"
## [1] "db was updated correctly with:  MTB"
## [1] "db was updated correctly with:  HBAN"
## [1] "db was updated correctly with:  ZION"
## [1] "db was updated correctly with:  CMA"
## [1] "db was updated correctly with:  FITB"

Explore SQL Database

  • Experiment with SQL database
##                         Field   Type Null Key Default Extra
## 1                     Revenue double  YES        <NA>      
## 2               RevenueGrowth double  YES        <NA>      
## 3               CostofRevenue double  YES        <NA>      
## 4                 GrossProfit double  YES        <NA>      
## 5                  RDExpenses double  YES        <NA>      
## 6                  SGAExpense double  YES        <NA>      
## 7           OperatingExpenses double  YES        <NA>      
## 8             OperatingIncome double  YES        <NA>      
## 9             InterestExpense double  YES        <NA>      
## 10          EarningsbeforeTax double  YES        <NA>      
## 11           IncomeTaxExpense double  YES        <NA>      
## 12 NetIncomeNonControllingint double  YES        <NA>      
## 13   NetIncomeDiscontinuedops double  YES        <NA>      
## 14                  NetIncome double  YES        <NA>      
## 15         PreferredDividends double  YES        <NA>      
## 16               NetIncomeCom double  YES        <NA>      
## 17                        EPS double  YES        <NA>      
## 18                 EPSDiluted double  YES        <NA>      
## 19      WeightedAverageShsOut double  YES        <NA>      
## 20   WeightedAverageShsOutDil double  YES        <NA>      
## 21           DividendperShare double  YES        <NA>      
## 22                GrossMargin double  YES        <NA>      
## 23               EBITDAMargin double  YES        <NA>      
## 24                 EBITMargin double  YES        <NA>      
## 25               ProfitMargin double  YES        <NA>      
## 26         FreeCashFlowmargin double  YES        <NA>      
## 27                     EBITDA double  YES        <NA>      
## 28                       EBIT double  YES        <NA>      
## 29         ConsolidatedIncome double  YES        <NA>      
## 30    EarningsBeforeTaxMargin double  YES        <NA>      
## 31            NetProfitMargin double  YES        <NA>      
## 32                Report_Date   date  YES        <NA>      
## 33                    Company   text  YES        <NA>      
## 34                         Id   text   NO PRI    <NA>
## 
## FALSE  TRUE 
## 15105    59
##           used (Mb) gc trigger  (Mb) max used (Mb)
## Ncells 1032373 55.2    2006685 107.2  1548993 82.8
## Vcells 1817225 13.9    8388608  64.0  2995184 22.9

Windows Functions

I look at 10 of the largest publicly traded banks by capsize. Windows functions can be useful to compare value over different periods. The main windows functions used are

  • LAG- allows us to select a value that lags current value
  • DENSERANK- allows us to rank results based upon partition
  • ROWS- allows us to grab rows preceding our current value

In my queries, LAG is used to create a growth column. Quarterly data can suffer from seasonality, therefore, we can use LAG set with 4 to compare the YOY(year over year) quarterly growth. For example, in our data (lag,4) returns Q1 2018 if our current value is Q1 2019. I then use a growth formula to measure growth. Next, I used DENSERANK to rank the results of our LAG query. This function builds a ranking based on growth and can be modified to look at industry level versus company level quarterly performance rankings. The last function builds a three-year rolling average. It’s designed to demonstrate how one would apply a cumulative sum type windows function query.

Build Windows Functions

  • All windows functions are listed below, but they will be called individually in their sections as well. Here, they are listed with some extra inline notes
### Uses Lag to build Year over year quarterly growth for desired columns

yty_growth <- function(statistic,table_name){
    my_conn<-dbConnect(RMariaDB::MariaDB(),
                   default.file=db_credentials,
                   group=my_sql_db) 
       query <- paste("WITH Last_Year  AS (SELECT Company,Report_Date,",statistic,", LAG(",statistic, ",4) OVER( PARTITION BY Company ORDER BY Report_Date)  Last_Year_",statistic," FROM ",table_name,") SELECT Company,Report_Date,",statistic, ",Last_Year_",statistic,", ROUND((Last_Year_",statistic,"-",statistic,")/Last_Year_",statistic,"*-1,3) AS Growth FROM Last_Year;",sep="" )
    res <- dbGetQuery(my_conn,query)
    datatable(res)
    gc()
    dbDisconnect(my_conn)
    return(datatable(res))
}


### Uses dense rank to rank to first develop a growth percent by any col and then rank those growth rates
### Function inputs are statistic, table name, partition set
### if input partition is set to 1, then function will rank individual Company performance
### If partition is not set to 1, function will return ranks of the aggregate banking industry

yty_ranking <- function(statistic,table_name,partition=0){
    ##Build connection
    my_conn<-dbConnect(RMariaDB::MariaDB(),
                   default.file=db_credentials,
                   group=my_sql_db)
    
    ##Build query without partition
    if (partition==0){
    query <- paste("WITH Last_Year  AS (SELECT Company,Report_Date,",
                   statistic,", LAG(",statistic, ",4) OVER( PARTITION BY Company ORDER BY Report_Date) Last_Year_",statistic,
                   " FROM ",table_name," ),", statistic,"_table AS (SELECT Company,Report_Date,",statistic,
                   ",Last_Year_",statistic,", ROUND((Last_Year_",statistic,"-",statistic,")/Last_Year_",statistic,
                   "*-1,3) AS Growth FROM Last_Year) SELECT *, CASE WHEN Growth IS NOT NULL then DENSE_RANK() OVER ( ORDER BY Growth desc) end) AS ranked_Growth FROM ",statistic,"_table;",sep="")
    ## Send query request and display result
    res <- dbGetQuery(my_conn,query)
    datatable(res)
    gc()
    dbDisconnect(my_conn)
    return(datatable(res))
    }
    
    ##Build query with partition
    else {
        query <- paste("WITH Last_Year  AS (SELECT Company,Report_Date,",statistic,", LAG(",statistic, ",4) OVER( PARTITION BY Company ORDER BY Report_Date) Last_Year_",statistic," FROM ",table_name," ),", statistic,"_table AS (SELECT Company,Report_Date,",statistic,",Last_Year_",statistic,", ROUND((Last_Year_",statistic,"-",statistic,")/Last_Year_",statistic,"*-1,3) AS Growth FROM Last_Year) SELECT *, (CASE WHEN Growth IS NOT NULL then DENSE_RANK() OVER ( PARTITION BY Company ORDER BY Growth desc) end) AS ranked_Growth FROM ",statistic,"_table;",sep="")
    
    ## Send query request and display result
    res <- dbGetQuery(my_conn,query)
    datatable(res)
    gc()
    dbDisconnect(my_conn)
    return(datatable(res))
    }
}

## aggregate data to yearly data and build out a 3 year moving  average 
## Query does not take any inputs, it's to show how you would create a moving average, similar to creating a cumulative sum value for customers

yearly_moving_avg <- function()
{
    my_conn<-dbConnect(RMariaDB::MariaDB(),
                   default.file=db_credentials,
                   group=my_sql_db)
    query <- paste(
    " WITH yearly_revenue AS (SELECT Company, Year(Report_Date) AS years, sum(revenue) AS revenues FROM financials GROUP BY Company,",
    "Year(Report_Date) ORDER BY Company,Year(Report_Date)) SELECT *, CASE WHEN years not in (2009,2010) then sum(revenues) OVER",
    "(ORDER BY Company, years rows between 3 preceding and current row) else null end AS 'Three_Year_Running_Avg' FROM yearly_revenue ORDER BY Company,years;", sep="" )
    
    ## Send query request and display result
    res <- dbGetQuery(my_conn,query)
    datatable(res)
    gc()
    dbDisconnect(my_conn)
    return(datatable(res))
}           

Execute yty_Growth

With Revenue

  • Inputs
    • Statistic (any numeric column)
    • Tablename (any character value)
  • Returns YOY Growth rate for company performance
  • Creates a CTE table Last_Year that uses a lag of 4 to grab the last year quarterly results and builds that column as a new column
  • The function finishes by building a Growth rate based on the newly created last year column versus the current year value.
## [1] "WITH Last_Year AS (SELECT Company,Report_Date,Revenue, LAG(Revenue,4) Over( PARTITION BY Company ORDER BY Report_Date)  Last_Year_Revenue FROM financials) SELECT Company,Report_Date,Revenue,Last_Year_Revenue, ROUND((Last_Year_Revenue-Revenue)/Last_Year_Revenue*-1,3) AS Growth FROM Last_Year;"

With Operating Expenses

  • Function is identical to above function
    • See Execute yty_growth with Revenue section for explanation of function
## [1] "WITH Last_Year  AS (SELECT Company,Report_Date,OperatingExpenses, LAG(OperatingExpenses,4) Over( PARTITION BY Company ORDER BY Report_Date)  Last_Year_OperatingExpenses FROM financials) SELECT Company,Report_Date,OperatingExpenses,Last_Year_OperatingExpenses, ROUND((Last_Year_OperatingExpenses-OperatingExpenses)/Last_Year_OperatingExpenses*-1,3) AS growth FROM Last_Year;"

Execute yty_ranking

PARTIONED BY Industry

  • The query uses a CTE to build a growth rate table(Last_Year). It then uses another CTE to call Last_Year and build a growth_rate column for input statistic. Finally it uses the results FROM the 2nd CTE and runs DENSERANK of the growth_rate with the option to additionally Partition the window on industry level or Company level.
  • If partition is set to 0, it will execute the query as grouped at the industry level.
  • If partion is set to another integer value, it will return company level ranked results
    • The below function returns rank at the industry level
yty_ranking <- function(statistic,table_name,partition=0){
    ##Build connection
    my_conn<-dbConnect(RMariaDB::MariaDB(),
                   default.file=db_credentials,
                   group=my_sql_db)
    
    ##Build query without partition
    if (partition==0){
    query <- paste("WITH Last_Year  AS (SELECT Company,Report_Date,",
                   statistic,", LAG(",statistic, ",4) Over( PARTITION BY Company ORDER BY Report_Date) Last_Year_",statistic,
                   " FROM ",table_name," ),", statistic,"_table AS (SELECT Company,Report_Date,",statistic,
                   ",Last_Year_",statistic,", ROUND((Last_Year_",statistic,"-",statistic,")/Last_Year_",statistic,
                   "*-1,3) AS growth FROM Last_Year) SELECT *, (CASE WHEN growth IS NOT NULL then DENSE_RANK() OVER ( ORDER BY growth desc) end) AS ranked_growth FROM ",statistic,"_table;",sep="")
    print(query)
    
    ## Send query request and display result
    res <- dbGetQuery(my_conn,query)
    datatable(res)
    gc()
    dbDisconnect(my_conn)
    return(datatable(res))
    }
    
    ##Build query with partition
    else {
        query <- paste("WITH Last_Year  AS (SELECT Company,Report_Date,",statistic,", LAG(",statistic, ",4) Over( PARTITION BY Company ORDER BY Report_Date) Last_Year_",statistic," FROM ",table_name," ),", statistic,"_table AS (SELECT Company,Report_Date,",statistic,",Last_Year_",statistic,", ROUND((Last_Year_",statistic,"-",statistic,")/Last_Year_",statistic,"*-1,3) AS growth FROM Last_Year) SELECT *, (CASE WHEN growth IS NOT NULL then DENSE_RANK() OVER ( PARTITION BY Company ORDER BY growth desc) end) AS ranked_growth FROM ",statistic,"_table;",sep="")
    print(query)
    ## Send query request and display result
    res <- dbGetQuery(my_conn,query)
    datatable(res)
    gc()
    dbDisconnect(my_conn)
    return(datatable(res))
    }
}

## call function
yty_ranking("Revenue","financials")
## [1] "WITH Last_Year  AS (SELECT Company,Report_Date,Revenue, LAG(Revenue,4) Over( PARTITION BY Company ORDER BY Report_Date) Last_Year_Revenue FROM financials ),Revenue_table AS (SELECT Company,Report_Date,Revenue,Last_Year_Revenue, ROUND((Last_Year_Revenue-Revenue)/Last_Year_Revenue*-1,3) AS growth FROM Last_Year) SELECT *, (CASE WHEN growth IS NOT NULL then DENSE_RANK() OVER ( ORDER BY growth desc) end) AS ranked_growth FROM Revenue_table;"

Partioned by Company

same function as above function

  • See previous explanation
  • This function sets partition to 1 and displays industry level quarterly results
yty_ranking <- function(statistic,table_name,partition=0){
    ##Build connection
    my_conn<-dbConnect(RMariaDB::MariaDB(),
                   default.file=db_credentials,
                   group=my_sql_db)
    
    ##Build query without partition
    if (partition==0){
    query <- paste("WITH Last_Year  AS (SELECT Company,Report_Date,",
                   statistic,", LAG(",statistic, ",4) OVER( PARTITION BY Company ORDER BY Report_Date) Last_Year_",statistic,
                   " FROM ",table_name," ),", statistic,"_table AS (SELECT Company,Report_Date,",statistic,
                   ",Last_Year_",statistic,", ROUND((Last_Year_",statistic,"-",statistic,")/Last_Year_",statistic,
                   "*-1,3) AS growth FROM Last_Year) SELECT *, (case when growth is not null then dense_rank() OVER ( ORDER BY growth desc) end) AS ranked_growth FROM ",statistic,"_table;",sep="")
    print(query)
    
    ## Send query request and display result
    res <- dbGetQuery(my_conn,query)
    datatable(res)
    gc()
    dbDisconnect(my_conn)
    return(datatable(res))
    }
    
    ##Build query with partition
    else {
        query <- paste("WITH Last_Year  AS (SELECT Company,Report_Date,",statistic,", LAG(",statistic, ",4) OVER( PARTITION BY Company ORDER BY Report_Date) Last_Year_",statistic," FROM ",table_name," ),", statistic,"_table AS (SELECT Company,Report_Date,",statistic,",Last_Year_",statistic,", ROUND((Last_Year_",statistic,"-",statistic,")/Last_Year_",statistic,"*-1,3) AS growth FROM Last_Year) SELECT *, (case when growth is not null then dense_rank() OVER ( PARTITION BY Company ORDER BY growth desc) end) AS ranked_growth FROM ",statistic,"_table;",sep="")
    print(query)
    
    ## Send query request and display result
    res <- dbGetQuery(my_conn,query)
    datatable(res)
    gc()
    dbDisconnect(my_conn)
    return(datatable(res))
    }
}

## call function
yty_ranking("Revenue","financials",partition=1)
## [1] "WITH Last_Year  AS (SELECT Company,Report_Date,Revenue, LAG(Revenue,4) OVER( PARTITION BY Company ORDER BY Report_Date) Last_Year_Revenue FROM financials ),Revenue_table AS (SELECT Company,Report_Date,Revenue,Last_Year_Revenue, ROUND((Last_Year_Revenue-Revenue)/Last_Year_Revenue*-1,3) AS growth FROM Last_Year) SELECT *, (case when growth is not null then dense_rank() OVER ( PARTITION BY Company ORDER BY growth desc) end) AS ranked_growth FROM Revenue_table;"

Execute Yearly Moving Average

  • This function is more for show than utility
  • As the directions suggested attempting a windows function tracking cumsum values for customers, this query mimics that workflow. It creates a three year moving average of the yearly revenue over a window including the 2 years prior.
    • Grabbing the three-year average was random. I could have just created cumsum for revenue, but as that value doesn’t mean much here, I figured I would add some moving parts to the query
    • Another moving part is that the query first uses a CTE to build yearly summed data from our quarterly data
    • The function uses that CTE table with the function ROWS to build the three year moving average. It adds CASEWHEN in order to ensure a NUll result for the years 2009 and 2010(less than 3 years)
      • One dangerous part of the query is if data is incomplete it will return some averages that may not be three year moving averages. In order to return Null for the first 2 values (2009 and 2010) I created a case when to return null for 2009 and 2010. If a company doesn’t start in 2009, say it went public in 2011, then it’s first and second running total would be “wrong”. In our case all data is complete

EDA In R

This section is just an extremely fast overview of capabilities of R to pull in data and create tables and graphs. We are simply exploring the revenue column and some quick visualizations for EDA. Overall little time was invested, but I will gladly show more if asked

Tables

Table of Revenue and Growth of Big Banks

Table of Total Revenue From 2009-2019 of Big Banks

company total_revenue
WFC 795758000000
USB 184094000000
PNC 161516000000
BBT 91973000000
STI 76026762000
FITB 60527000000
KEY 44725000000
MTB 43827760000
HBAN 29365887000
CMA 26165000000
ZION 21116056000

Table of Average Quarterly Revenue From 2009-2019 of Big Banks

company avg_quarterly_revnue
WFC 19893950000
USB 4490097561
PNC 3939414634
BBT 2243243902
STI 1900669050
FITB 1513175000
MTB 1095694000
KEY 1090853659
HBAN 716241146
CMA 638170732
ZION 527901400

Table of Cumulative Sum of Revenue From 2009-2019 of Big Banks

Graphs

We can see that Wells Fargo is way too large and it effects our comparisons. Lets exclude Wells Fargo

While the cumulative sum can give us a picture of overall strength in the market and while it appears some banks broke away around 2011, looking at quarterly revenue could yield more fruitful.

The quarterly revenues give us a much clearer story. Almost all the banks have increased revenue from 2009. We can explore growth rates like those we made in SQL as well. Revenue Growth has always been present in the data, so we can just call the column and see that it matches up correctly off by less than .001 in 384/385 cases (rounding errors) with a custom built R function. I print the R code below so you can see the ease of code required

## 
## FALSE  TRUE 
##     1   365

Then we can graph growth rate

Our graph allows us to see that there were some intense swings in the growth rate of HBAN and KEY My assumption would be they have relatively small cap sizes, or they experienced prolonged damage from the recession. We can explore their cap size relative to other banks. Another thing to explore would be filtering this from 2012 forward to get a better idea of swings minus huge outliers and getting several years removed from recession. We can also explore all negative growth rate quarters and see if there were any patterns. We would likely want to build individual graphs of these banks as well.

  • Get total revenue from 2009 to 2011 and rank it
  • We can see below that HBAN and KEY are in fact small cap banks, or at least were in 2009-2011
company sum_revenue
PNC 42044000000
USB 38519000000
BBT 18535000000
STI 15827762000
FITB 14279000000
KEY 8878000000
MTB 8688000000
CMA 5386000000
HBAN 4532340000
ZION 4147812000

Graphing HBAN and KEY only

Getting only negative growth rates

Animated Cumulative Revenue Race

  • animation tracks cumulative revenue from 2009 through 2018.
  • data is complete so no cartesian join was needed and I think its a cool way of visualizing campaign success.
  • If we want to show a client how they have increased growth in some field versus the industry as a whole, this graph can show them where they started and the success and process of things we implemented to get them to where they are now relative to competitors .