The exercise consist in sqlite basic formulas applications in python. The same exercise was applied in R, on another page, which I encourage you to have a look on it. It is obvious the sql codes are the same but passing Python or R libary is the difference, consquently, functions are also different. See next page for these details. It is fairly easy to learn sql in pyton and r
import sqlite3
import pandas as pd
conn = sqlite3.connect("./Desktop/zopa_data_for_test.db")
cur = conn.cursor()
#Below we will explore a sample dataset of actual loans originated by Zopa, in SQL. Reading the dataset and setting up the db connection
cur.execute("SELECT * from loan_book limit 5;")
results = cur.fetchall()
print(results)
cur.close()
conn.close()
## [('E', 10000.0, 60, 0.285, '2016-09-13 09:57:07.347000', None, 22000.0, -8536411584091361334, 5448905234911242152, 22000.0), ('C1', 8600.0, 60, 0.1482, '2016-09-13 10:17:41.043000', None, 41000.0, 2586587516215590820, -6961035832149058614, 41000.0), ('A1', 5000.0, 60, 0.0653, '2016-09-14 08:33:32.090000', None, 28000.0, 5812729143149005631, 7002479678934768268, 28000.0), ('D', 5000.0, 48, 0.182, '2016-09-14 16:49:36.867000', None, 25000.0, 2455674728913840239, -6107882007346940517, 25000.0), ('A2', 20000.0, 60, 0.0842, '2016-09-13 14:40:43.777000', None, 20100.0, -5412804541701733109, 8229682404651592959, 20100.0)]
Checking how many tables are in the database and what are the columns in each dataset.
import sqlite3
import pandas as pd
conn = sqlite3.connect("./Desktop/zopa_data_for_test.db")
df1 = pd.read_sql_query("SELECT * from loan_book limit 5;", conn)
print(df1)
## market Amount Term_nMonths Annual_interest_rate \
## 0 E 10000.0 60 0.2850
## 1 C1 8600.0 60 0.1482
## 2 A1 5000.0 60 0.0653
## 3 D 5000.0 48 0.1820
## 4 A2 20000.0 60 0.0842
##
## Date_of_disbursal Date_of_default Declared_income \
## 0 2016-09-13 09:57:07.347000 None 22000.0
## 1 2016-09-13 10:17:41.043000 None 41000.0
## 2 2016-09-14 08:33:32.090000 None 28000.0
## 3 2016-09-14 16:49:36.867000 None 25000.0
## 4 2016-09-13 14:40:43.777000 None 20100.0
##
## loan_id member_id Declared_incom
## 0 -8536411584091361334 5448905234911242152 22000.0
## 1 2586587516215590820 -6961035832149058614 41000.0
## 2 5812729143149005631 7002479678934768268 28000.0
## 3 2455674728913840239 -6107882007346940517 25000.0
## 4 -5412804541701733109 8229682404651592959 20100.0
Below is an example of how to query the SQLite database. SQLite might have different syntax from the SQL flavour but for the simple tasks, this is no big difference. The first table “loan_book” contains a sample of actual loans disbursed through Zopa, but limited to 5 rows.
import sqlite3
import pandas as pd
conn = sqlite3.connect("./Desktop/zopa_data_for_test.db")
df2 = pd.read_sql_query("select * from loan_book limit 5", conn)
print(df2)
## market Amount Term_nMonths Annual_interest_rate \
## 0 E 10000.0 60 0.2850
## 1 C1 8600.0 60 0.1482
## 2 A1 5000.0 60 0.0653
## 3 D 5000.0 48 0.1820
## 4 A2 20000.0 60 0.0842
##
## Date_of_disbursal Date_of_default Declared_income \
## 0 2016-09-13 09:57:07.347000 None 22000.0
## 1 2016-09-13 10:17:41.043000 None 41000.0
## 2 2016-09-14 08:33:32.090000 None 28000.0
## 3 2016-09-14 16:49:36.867000 None 25000.0
## 4 2016-09-13 14:40:43.777000 None 20100.0
##
## loan_id member_id Declared_incom
## 0 -8536411584091361334 5448905234911242152 22000.0
## 1 2586587516215590820 -6961035832149058614 41000.0
## 2 5812729143149005631 7002479678934768268 28000.0
## 3 2455674728913840239 -6107882007346940517 25000.0
## 4 -5412804541701733109 8229682404651592959 20100.0
The second table “default_rate_estimate” contains a lookup table of the expected default rate for various type of loans, limited to 5 rows.
import sqlite3
import pandas as pd
conn = sqlite3.connect("./Desktop/zopa_data_for_test.db")
df3 = pd.read_sql_query("select * from default_rate_estimate limit 5;", conn)
print(df3)
#What are the different "market" we have?
df3 = pd.read_sql_query("SELECT DISTINCT market FROM loan_book", conn)
print(df3)
## market Term_nMonths Annual_default_rate_estimate
## 0 A* 12 0.00344
## 1 A* 24 0.00345
## 2 A* 36 0.00303
## 3 A* 48 0.00356
## 4 A* 60 0.00446
## market
## 0 E
## 1 C1
## 2 A1
## 3 D
## 4 A2
## 5 B
## 6 A*
import sqlite3
import pandas as pd
conn = sqlite3.connect("./Desktop/zopa_data_for_test.db")
#How many loans we have in each market?
df5 = pd.read_sql_query("""
SELECT market, COUNT(*) as AMOUNT
FROM loan_book
GROUP BY market;""", conn)
print(df5)
## market AMOUNT
## 0 A* 21877
## 1 A1 13097
## 2 A2 13657
## 3 B 14527
## 4 C1 16849
## 5 D 11841
## 6 E 7776
Find top 3 calendar months with highest average values of estimated annual default rate
import sqlite3
import pandas as pd
conn = sqlite3.connect("./Desktop/zopa_data_for_test.db")
df5 = pd.read_sql_query(
"""
SELECT loan_book.Date_of_disbursal,
strftime('%Y-%m', Date_of_disbursal) as 'Year&Month',
avg(Annual_default_rate_estimate) as 'Avg - Annual Default_rate_estimate'
FROM default_rate_estimate
INNER JOIN loan_book ON loan_book.market = default_rate_estimate.market
GROUP BY 'Year&Month'
ORDER BY 'Avg - Annual Default_rate_estimate' DESC
LIMIT 3;""", conn)
print(df5)
## Date_of_disbursal Year&Month Avg - Annual Default_rate_estimate
## 0 2016-12-31 21:26:41.137000 2016-12 0.049669
In each market, what is the average of estimated annual default rate, and what is the actual default rate in the first year after disbursal?¶
import sqlite3
import pandas as pd
conn = sqlite3.connect("./Desktop/zopa_data_for_test.db")
df6 = pd.read_sql_query("""
SELECT loan_book.Date_of_disbursal,
strftime('%Y-%m', Date_of_disbursal) as 'Year&Month',
avg(Annual_default_rate_estimate) as 'Avg - AnnualDefault_rate_estimate'
FROM default_rate_estimate
INNER JOIN loan_book ON loan_book.market = default_rate_estimate.market
GROUP BY 'Year&Month'
ORDER BY 'Avg - Annual Default_rate_estimate' DESC LIMIT 3;""", conn)
print(df6)
## Date_of_disbursal Year&Month Avg - AnnualDefault_rate_estimate
## 0 2016-12-31 21:26:41.137000 2016-12 0.049669
In each market, what is the average of estimated annual default rate, and what is the actual default rate in the first year after disbursal?
import sqlite3
import pandas as pd
conn = sqlite3.connect("./Desktop/zopa_data_for_test.db")
df7 = pd.read_sql_query("""
SELECT loan_book.market,
SUM(Date_of_default) as 'Actual Default Rate 1st Year from Disbursal',
AVG(Annual_default_rate_estimate) as 'Avg - Annual Default_rate_estimate'
FROM default_rate_estimate
INNER JOIN loan_book ON loan_book.market = default_rate_estimate.market
WHERE strftime('%Y', Date_of_default)= '2016'
GROUP BY loan_book.market
ORDER BY 'Actual Default Rate 1st Year from Disbursal' DESC;""", conn)
print(df7)
## market Actual Default Rate 1st Year from Disbursal \
## 0 A* 40320.0
## 1 A1 171360.0
## 2 A2 252000.0
## 3 B 695520.0
## 4 C1 1451520.0
## 5 D 1653120.0
## 6 E 1058400.0
##
## Avg - Annual Default_rate_estimate
## 0 0.003588
## 1 0.010130
## 2 0.018248
## 3 0.035702
## 4 0.064888
## 5 0.134922
## 6 0.164392
How many borrowers had more than one loans from us?
import sqlite3
import pandas as pd
conn = sqlite3.connect("./Desktop/zopa_data_for_test.db")
df8 = pd.read_sql_query("""
SELECT COUNT(member_id) as 'Borrowers more than one loans'
FROM loan_book
WHERE member_id >1;""", conn)
print(df8)
## Borrowers more than one loans
## 0 49886
How many of these re-borrowers got a new loan with lower interest rate than their existing/old ones?
import sqlite3
import pandas as pd
conn = sqlite3.connect("./Desktop/zopa_data_for_test.db")
df9 = pd.read_sql_query("""
SELECT COUNT(OldLoan.member_id) as 'Total of borrowers with a new lower interest loan'
FROM loan_book AS OldLoan, loan_book AS NewLoan
WHERE NEWLoan.member_id = OldLoan.member_id
AND NewLoan.loan_id > OldLoan.loan_id
AND NewLoan.Annual_interest_rate < OldLoan.Annual_interest_Rate;""", conn)
print(df9)
## Total of borrowers with a new lower interest loan
## 0 1941