SQL, Python - ZOPA loan dataset

The exercise consist in sqlite basic formulas applications in python. The same exercise was applied in R, on another page, which I encourage you to have a look on it. It is obvious the sql codes are the same but passing Python or R libary is the difference, consquently, functions are also different. See next page for these details. It is fairly easy to learn sql in pyton and r


import sqlite3 
import pandas as pd

conn = sqlite3.connect("./Desktop/zopa_data_for_test.db")

cur = conn.cursor()


#Below we will explore a sample dataset of actual loans originated by Zopa, in SQL. Reading the dataset and setting up the db connection

cur.execute("SELECT * from loan_book limit 5;")

results = cur.fetchall()
print(results)

cur.close()
conn.close()

## [('E', 10000.0, 60, 0.285, '2016-09-13 09:57:07.347000', None, 22000.0, -8536411584091361334, 5448905234911242152, 22000.0), ('C1', 8600.0, 60, 0.1482, '2016-09-13 10:17:41.043000', None, 41000.0, 2586587516215590820, -6961035832149058614, 41000.0), ('A1', 5000.0, 60, 0.0653, '2016-09-14 08:33:32.090000', None, 28000.0, 5812729143149005631, 7002479678934768268, 28000.0), ('D', 5000.0, 48, 0.182, '2016-09-14 16:49:36.867000', None, 25000.0, 2455674728913840239, -6107882007346940517, 25000.0), ('A2', 20000.0, 60, 0.0842, '2016-09-13 14:40:43.777000', None, 20100.0, -5412804541701733109, 8229682404651592959, 20100.0)]

Checking how many tables are in the database and what are the columns in each dataset.


import sqlite3
import pandas as pd
conn = sqlite3.connect("./Desktop/zopa_data_for_test.db")
df1 = pd.read_sql_query("SELECT * from loan_book limit 5;", conn)

print(df1)

##   market   Amount  Term_nMonths  Annual_interest_rate  \
## 0      E  10000.0            60                0.2850   
## 1     C1   8600.0            60                0.1482   
## 2     A1   5000.0            60                0.0653   
## 3      D   5000.0            48                0.1820   
## 4     A2  20000.0            60                0.0842   
## 
##             Date_of_disbursal Date_of_default  Declared_income  \
## 0  2016-09-13 09:57:07.347000            None          22000.0   
## 1  2016-09-13 10:17:41.043000            None          41000.0   
## 2  2016-09-14 08:33:32.090000            None          28000.0   
## 3  2016-09-14 16:49:36.867000            None          25000.0   
## 4  2016-09-13 14:40:43.777000            None          20100.0   
## 
##                loan_id            member_id  Declared_incom  
## 0 -8536411584091361334  5448905234911242152         22000.0  
## 1  2586587516215590820 -6961035832149058614         41000.0  
## 2  5812729143149005631  7002479678934768268         28000.0  
## 3  2455674728913840239 -6107882007346940517         25000.0  
## 4 -5412804541701733109  8229682404651592959         20100.0

Below is an example of how to query the SQLite database. SQLite might have different syntax from the SQL flavour but for the simple tasks, this is no big difference. The first table “loan_book” contains a sample of actual loans disbursed through Zopa, but limited to 5 rows.


import sqlite3
import pandas as pd
conn = sqlite3.connect("./Desktop/zopa_data_for_test.db")
df2 = pd.read_sql_query("select * from loan_book limit 5", conn)
print(df2)

##   market   Amount  Term_nMonths  Annual_interest_rate  \
## 0      E  10000.0            60                0.2850   
## 1     C1   8600.0            60                0.1482   
## 2     A1   5000.0            60                0.0653   
## 3      D   5000.0            48                0.1820   
## 4     A2  20000.0            60                0.0842   
## 
##             Date_of_disbursal Date_of_default  Declared_income  \
## 0  2016-09-13 09:57:07.347000            None          22000.0   
## 1  2016-09-13 10:17:41.043000            None          41000.0   
## 2  2016-09-14 08:33:32.090000            None          28000.0   
## 3  2016-09-14 16:49:36.867000            None          25000.0   
## 4  2016-09-13 14:40:43.777000            None          20100.0   
## 
##                loan_id            member_id  Declared_incom  
## 0 -8536411584091361334  5448905234911242152         22000.0  
## 1  2586587516215590820 -6961035832149058614         41000.0  
## 2  5812729143149005631  7002479678934768268         28000.0  
## 3  2455674728913840239 -6107882007346940517         25000.0  
## 4 -5412804541701733109  8229682404651592959         20100.0

The second table “default_rate_estimate” contains a lookup table of the expected default rate for various type of loans, limited to 5 rows.


import sqlite3
import pandas as pd
conn = sqlite3.connect("./Desktop/zopa_data_for_test.db")

df3 = pd.read_sql_query("select * from default_rate_estimate limit 5;", conn)

print(df3)

#What are the different "market" we have?
df3 = pd.read_sql_query("SELECT DISTINCT market FROM loan_book", conn)

print(df3)

##   market  Term_nMonths  Annual_default_rate_estimate
## 0     A*            12                       0.00344
## 1     A*            24                       0.00345
## 2     A*            36                       0.00303
## 3     A*            48                       0.00356
## 4     A*            60                       0.00446
##   market
## 0      E
## 1     C1
## 2     A1
## 3      D
## 4     A2
## 5      B
## 6     A*


import sqlite3
import pandas as pd
conn = sqlite3.connect("./Desktop/zopa_data_for_test.db")

#How many loans we have in each market?
df5 = pd.read_sql_query("""
SELECT market, COUNT(*) as AMOUNT 
FROM loan_book 
GROUP BY market;""", conn) 

print(df5)

##   market  AMOUNT
## 0     A*   21877
## 1     A1   13097
## 2     A2   13657
## 3      B   14527
## 4     C1   16849
## 5      D   11841
## 6      E    7776

Find top 3 calendar months with highest average values of estimated annual default rate


import sqlite3
import pandas as pd
conn = sqlite3.connect("./Desktop/zopa_data_for_test.db")
df5 = pd.read_sql_query(
"""
SELECT loan_book.Date_of_disbursal, 
        strftime('%Y-%m', Date_of_disbursal) as 'Year&Month', 
        avg(Annual_default_rate_estimate) as 'Avg - Annual Default_rate_estimate'
FROM default_rate_estimate 
INNER JOIN loan_book ON loan_book.market = default_rate_estimate.market 
GROUP BY 'Year&Month' 
ORDER BY 'Avg - Annual Default_rate_estimate' DESC
LIMIT 3;""", conn)

print(df5)

##             Date_of_disbursal Year&Month  Avg - Annual Default_rate_estimate
## 0  2016-12-31 21:26:41.137000    2016-12                            0.049669

In each market, what is the average of estimated annual default rate, and what is the actual default rate in the first year after disbursal?¶


import sqlite3
import pandas as pd
conn = sqlite3.connect("./Desktop/zopa_data_for_test.db")

df6 = pd.read_sql_query("""
SELECT loan_book.Date_of_disbursal,
                                  strftime('%Y-%m', Date_of_disbursal) as 'Year&Month', 
                                  avg(Annual_default_rate_estimate) as 'Avg - AnnualDefault_rate_estimate'
FROM default_rate_estimate
INNER JOIN loan_book ON loan_book.market = default_rate_estimate.market
GROUP BY 'Year&Month'
ORDER BY 'Avg - Annual Default_rate_estimate' DESC LIMIT 3;""", conn)

print(df6)

##             Date_of_disbursal Year&Month  Avg - AnnualDefault_rate_estimate
## 0  2016-12-31 21:26:41.137000    2016-12                           0.049669

In each market, what is the average of estimated annual default rate, and what is the actual default rate in the first year after disbursal?


import sqlite3
import pandas as pd
conn = sqlite3.connect("./Desktop/zopa_data_for_test.db")

df7 = pd.read_sql_query("""
SELECT loan_book.market,
                        SUM(Date_of_default) as 'Actual Default Rate 1st Year from Disbursal',
                        AVG(Annual_default_rate_estimate) as 'Avg - Annual Default_rate_estimate'
FROM default_rate_estimate
                        INNER JOIN loan_book ON loan_book.market = default_rate_estimate.market
WHERE strftime('%Y', Date_of_default)= '2016'
GROUP BY loan_book.market
ORDER BY 'Actual Default Rate 1st Year from Disbursal' DESC;""", conn)

print(df7)

##   market  Actual Default Rate 1st Year from Disbursal  \
## 0     A*                                      40320.0   
## 1     A1                                     171360.0   
## 2     A2                                     252000.0   
## 3      B                                     695520.0   
## 4     C1                                    1451520.0   
## 5      D                                    1653120.0   
## 6      E                                    1058400.0   
## 
##    Avg - Annual Default_rate_estimate  
## 0                            0.003588  
## 1                            0.010130  
## 2                            0.018248  
## 3                            0.035702  
## 4                            0.064888  
## 5                            0.134922  
## 6                            0.164392

How many borrowers had more than one loans from us?


import sqlite3
import pandas as pd
conn = sqlite3.connect("./Desktop/zopa_data_for_test.db")
df8 = pd.read_sql_query("""
SELECT  COUNT(member_id) as 'Borrowers more than one loans'
FROM loan_book
WHERE member_id >1;""", conn)

print(df8)

##    Borrowers more than one loans
## 0                          49886

How many of these re-borrowers got a new loan with lower interest rate than their existing/old ones?


import sqlite3
import pandas as pd
conn = sqlite3.connect("./Desktop/zopa_data_for_test.db")

df9 = pd.read_sql_query(""" 
SELECT COUNT(OldLoan.member_id) as 'Total of borrowers with a new lower interest loan'
FROM loan_book AS OldLoan, loan_book AS NewLoan
WHERE NEWLoan.member_id = OldLoan.member_id
                            AND NewLoan.loan_id > OldLoan.loan_id
                            AND NewLoan.Annual_interest_rate < OldLoan.Annual_interest_Rate;""", conn)

print(df9)

##    Total of borrowers with a new lower interest loan
## 0                                               1941

SQL, Python - ZOPA loan dataset

Gabriel Burcea

02/04/2018