Case Study 1

Describe the dataset and any issues with it

There are a total of 10,000 observations and 55 features, including both numeric and categorical variables, associated with the dataset. The summary function below gives a general description of each variable.

There are some NAs for some variables. For example, there are 817 NAs for emp_length (number of years in the job). These NAs appear when the emp_title is blank. This is reasonable since some applicants are not employed or have already retired. There are 8505 NAs for annual_income_joint and debt_to_income_joint, indicating that these people are not joint applicants. There are 5658 NAs for months_since_last_delinq and 7715 NAs for months_since_90d_late. Since the number of NAs exceeds 50% of the total observations, we do not consider these two columns in further study.

In addition, columns current_accounts_delinq, num_accounts_120d_past_due, and num_accounts_30d_past_due have nearly all values equaling 0, so we do not consider these variables in further study.

## Read the Data
loan <- read.csv("loans_full_schema.csv")
head(loan)
##                 emp_title emp_length state homeownership annual_income
## 1 global config engineer           3    NJ      MORTGAGE         90000
## 2  warehouse office clerk         10    HI          RENT         40000
## 3                assembly          3    WI          RENT         40000
## 4        customer service          1    PA          RENT         30000
## 5    security supervisor          10    CA          RENT         35000
## 6                                 NA    KY           OWN         34000
##   verified_income debt_to_income annual_income_joint verification_income_joint
## 1        Verified          18.01                  NA                          
## 2    Not Verified           5.04                  NA                          
## 3 Source Verified          21.15                  NA                          
## 4    Not Verified          10.16                  NA                          
## 5        Verified          57.96               57000                  Verified
## 6    Not Verified           6.46                  NA                          
##   debt_to_income_joint delinq_2y months_since_last_delinq earliest_credit_line
## 1                   NA         0                       38                 2001
## 2                   NA         0                       NA                 1996
## 3                   NA         0                       28                 2006
## 4                   NA         0                       NA                 2007
## 5                37.66         0                       NA                 2008
## 6                   NA         1                        3                 1990
##   inquiries_last_12m total_credit_lines open_credit_lines total_credit_limit
## 1                  6                 28                10              70795
## 2                  1                 30                14              28800
## 3                  4                 31                10              24193
## 4                  0                  4                 4              25400
## 5                  7                 22                16              69839
## 6                  6                 32                12              42100
##   total_credit_utilized num_collections_last_12m num_historical_failed_to_pay
## 1                 38767                        0                            0
## 2                  4321                        0                            1
## 3                 16000                        0                            0
## 4                  4997                        0                            1
## 5                 52722                        0                            0
## 6                  3898                        0                            0
##   months_since_90d_late current_accounts_delinq total_collection_amount_ever
## 1                    38                       0                         1250
## 2                    NA                       0                            0
## 3                    28                       0                          432
## 4                    NA                       0                            0
## 5                    NA                       0                            0
## 6                    60                       0                            0
##   current_installment_accounts accounts_opened_24m
## 1                            2                   5
## 2                            0                  11
## 3                            1                  13
## 4                            1                   1
## 5                            1                   6
## 6                            0                   2
##   months_since_last_credit_inquiry num_satisfactory_accounts
## 1                                5                        10
## 2                                8                        14
## 3                                7                        10
## 4                               15                         4
## 5                                4                        16
## 6                                5                        12
##   num_accounts_120d_past_due num_accounts_30d_past_due
## 1                          0                         0
## 2                          0                         0
## 3                          0                         0
## 4                          0                         0
## 5                          0                         0
## 6                          0                         0
##   num_active_debit_accounts total_debit_limit num_total_cc_accounts
## 1                         2             11100                    14
## 2                         3             16500                    24
## 3                         3              4300                    14
## 4                         2             19400                     3
## 5                        10             32700                    20
## 6                         1             27200                    27
##   num_open_cc_accounts num_cc_carrying_balance num_mort_accounts
## 1                    8                       6                 1
## 2                   14                       4                 0
## 3                    8                       6                 0
## 4                    3                       2                 0
## 5                   15                      13                 0
## 6                   12                       5                 3
##   account_never_delinq_percent tax_liens public_record_bankrupt
## 1                         92.9         0                      0
## 2                        100.0         0                      1
## 3                         93.5         0                      0
## 4                        100.0         1                      0
## 5                        100.0         0                      0
## 6                         78.1         0                      0
##         loan_purpose application_type loan_amount term interest_rate
## 1             moving       individual       28000   60         14.07
## 2 debt_consolidation       individual        5000   36         12.61
## 3              other       individual        2000   36         17.09
## 4 debt_consolidation       individual       21600   36          6.72
## 5        credit_card            joint       23000   36         14.07
## 6              other       individual        5000   36          6.72
##   installment grade sub_grade issue_month loan_status initial_listing_status
## 1      652.53     C        C3    Mar-2018     Current                  whole
## 2      167.54     C        C1    Feb-2018     Current                  whole
## 3       71.40     D        D1    Feb-2018     Current             fractional
## 4      664.19     A        A3    Jan-2018     Current                  whole
## 5      786.87     C        C3    Mar-2018     Current                  whole
## 6      153.75     A        A3    Jan-2018     Current                  whole
##   disbursement_method  balance paid_total paid_principal paid_interest
## 1                Cash 27015.86    1999.33         984.14       1015.19
## 2                Cash  4651.37     499.12         348.63        150.49
## 3                Cash  1824.63     281.80         175.37        106.43
## 4                Cash 18853.26    3312.89        2746.74        566.15
## 5                Cash 21430.15    2324.65        1569.85        754.80
## 6                Cash  4256.71     873.13         743.29        129.84
##   paid_late_fees
## 1              0
## 2              0
## 3              0
## 4              0
## 5              0
## 6              0
dim(loan)
## [1] 10000    55
summary(loan)
##   emp_title           emp_length       state           homeownership     
##  Length:10000       Min.   : 0.00   Length:10000       Length:10000      
##  Class :character   1st Qu.: 2.00   Class :character   Class :character  
##  Mode  :character   Median : 6.00   Mode  :character   Mode  :character  
##                     Mean   : 5.93                                        
##                     3rd Qu.:10.00                                        
##                     Max.   :10.00                                        
##                     NA's   :817                                          
##  annual_income     verified_income    debt_to_income   annual_income_joint
##  Min.   :      0   Length:10000       Min.   :  0.00   Min.   :  19200    
##  1st Qu.:  45000   Class :character   1st Qu.: 11.06   1st Qu.:  86834    
##  Median :  65000   Mode  :character   Median : 17.57   Median : 113000    
##  Mean   :  79222                      Mean   : 19.31   Mean   : 127915    
##  3rd Qu.:  95000                      3rd Qu.: 25.00   3rd Qu.: 151546    
##  Max.   :2300000                      Max.   :469.09   Max.   :1100000    
##                                       NA's   :24       NA's   :8505       
##  verification_income_joint debt_to_income_joint   delinq_2y     
##  Length:10000              Min.   : 0.32        Min.   : 0.000  
##  Class :character          1st Qu.:14.16        1st Qu.: 0.000  
##  Mode  :character          Median :19.72        Median : 0.000  
##                            Mean   :19.98        Mean   : 0.216  
##                            3rd Qu.:25.50        3rd Qu.: 0.000  
##                            Max.   :39.98        Max.   :13.000  
##                            NA's   :8505                         
##  months_since_last_delinq earliest_credit_line inquiries_last_12m
##  Min.   :  1.00           Min.   :1963         Min.   : 0.000    
##  1st Qu.: 19.00           1st Qu.:1997         1st Qu.: 0.000    
##  Median : 34.00           Median :2003         Median : 1.000    
##  Mean   : 36.76           Mean   :2001         Mean   : 1.958    
##  3rd Qu.: 53.00           3rd Qu.:2006         3rd Qu.: 3.000    
##  Max.   :118.00           Max.   :2015         Max.   :29.000    
##  NA's   :5658                                                    
##  total_credit_lines open_credit_lines total_credit_limit total_credit_utilized
##  Min.   : 2.00      Min.   : 0.0      Min.   :      0    Min.   :     0       
##  1st Qu.:14.00      1st Qu.: 7.0      1st Qu.:  51594    1st Qu.: 19186       
##  Median :21.00      Median :10.0      Median : 114667    Median : 36927       
##  Mean   :22.68      Mean   :11.4      Mean   : 183606    Mean   : 51049       
##  3rd Qu.:29.00      3rd Qu.:14.0      3rd Qu.: 267550    3rd Qu.: 65421       
##  Max.   :87.00      Max.   :51.0      Max.   :3386034    Max.   :942456       
##                                                                               
##  num_collections_last_12m num_historical_failed_to_pay months_since_90d_late
##  Min.   :0.0000           Min.   : 0.0000              Min.   :  2.00       
##  1st Qu.:0.0000           1st Qu.: 0.0000              1st Qu.: 29.00       
##  Median :0.0000           Median : 0.0000              Median : 47.00       
##  Mean   :0.0138           Mean   : 0.1671              Mean   : 46.11       
##  3rd Qu.:0.0000           3rd Qu.: 0.0000              3rd Qu.: 63.00       
##  Max.   :3.0000           Max.   :52.0000              Max.   :128.00       
##                                                        NA's   :7715         
##  current_accounts_delinq total_collection_amount_ever
##  Min.   :0e+00           Min.   :     0.0            
##  1st Qu.:0e+00           1st Qu.:     0.0            
##  Median :0e+00           Median :     0.0            
##  Mean   :1e-04           Mean   :   184.3            
##  3rd Qu.:0e+00           3rd Qu.:     0.0            
##  Max.   :1e+00           Max.   :199308.0            
##                                                      
##  current_installment_accounts accounts_opened_24m
##  Min.   : 0.000               Min.   : 0.000     
##  1st Qu.: 1.000               1st Qu.: 2.000     
##  Median : 2.000               Median : 4.000     
##  Mean   : 2.664               Mean   : 4.376     
##  3rd Qu.: 3.000               3rd Qu.: 6.000     
##  Max.   :35.000               Max.   :29.000     
##                                                  
##  months_since_last_credit_inquiry num_satisfactory_accounts
##  Min.   : 0.000                   Min.   : 0.00            
##  1st Qu.: 2.000                   1st Qu.: 7.00            
##  Median : 6.000                   Median :10.00            
##  Mean   : 7.341                   Mean   :11.38            
##  3rd Qu.:11.000                   3rd Qu.:14.00            
##  Max.   :24.000                   Max.   :51.00            
##  NA's   :1271                                              
##  num_accounts_120d_past_due num_accounts_30d_past_due num_active_debit_accounts
##  Min.   :0                  Min.   :0e+00             Min.   : 0.000           
##  1st Qu.:0                  1st Qu.:0e+00             1st Qu.: 2.000           
##  Median :0                  Median :0e+00             Median : 3.000           
##  Mean   :0                  Mean   :1e-04             Mean   : 3.595           
##  3rd Qu.:0                  3rd Qu.:0e+00             3rd Qu.: 5.000           
##  Max.   :0                  Max.   :1e+00             Max.   :32.000           
##  NA's   :318                                                                   
##  total_debit_limit num_total_cc_accounts num_open_cc_accounts
##  Min.   :     0    Min.   : 2.00         Min.   : 0.000      
##  1st Qu.: 10000    1st Qu.: 7.00         1st Qu.: 5.000      
##  Median : 19500    Median :11.00         Median : 7.000      
##  Mean   : 27357    Mean   :13.03         Mean   : 8.095      
##  3rd Qu.: 36100    3rd Qu.:17.00         3rd Qu.:10.000      
##  Max.   :386700    Max.   :66.00         Max.   :46.000      
##                                                              
##  num_cc_carrying_balance num_mort_accounts account_never_delinq_percent
##  Min.   : 0.000          Min.   : 0.000    Min.   : 14.30              
##  1st Qu.: 3.000          1st Qu.: 0.000    1st Qu.: 92.60              
##  Median : 5.000          Median : 1.000    Median :100.00              
##  Mean   : 5.231          Mean   : 1.383    Mean   : 94.65              
##  3rd Qu.: 7.000          3rd Qu.: 2.000    3rd Qu.:100.00              
##  Max.   :43.000          Max.   :14.000    Max.   :100.00              
##                                                                        
##    tax_liens       public_record_bankrupt loan_purpose       application_type  
##  Min.   : 0.0000   Min.   :0.0000         Length:10000       Length:10000      
##  1st Qu.: 0.0000   1st Qu.:0.0000         Class :character   Class :character  
##  Median : 0.0000   Median :0.0000         Mode  :character   Mode  :character  
##  Mean   : 0.0433   Mean   :0.1238                                              
##  3rd Qu.: 0.0000   3rd Qu.:0.0000                                              
##  Max.   :52.0000   Max.   :3.0000                                              
##                                                                                
##   loan_amount         term       interest_rate    installment     
##  Min.   : 1000   Min.   :36.00   Min.   : 5.31   Min.   :  30.75  
##  1st Qu.: 8000   1st Qu.:36.00   1st Qu.: 9.43   1st Qu.: 256.04  
##  Median :14500   Median :36.00   Median :11.98   Median : 398.42  
##  Mean   :16362   Mean   :43.27   Mean   :12.43   Mean   : 476.21  
##  3rd Qu.:24000   3rd Qu.:60.00   3rd Qu.:15.05   3rd Qu.: 644.69  
##  Max.   :40000   Max.   :60.00   Max.   :30.94   Max.   :1566.59  
##                                                                   
##     grade            sub_grade         issue_month        loan_status       
##  Length:10000       Length:10000       Length:10000       Length:10000      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  initial_listing_status disbursement_method    balance        paid_total     
##  Length:10000           Length:10000        Min.   :    0   Min.   :    0.0  
##  Class :character       Class :character    1st Qu.: 6679   1st Qu.:  928.7  
##  Mode  :character       Mode  :character    Median :12380   Median : 1563.3  
##                                             Mean   :14459   Mean   : 2494.2  
##                                             3rd Qu.:20690   3rd Qu.: 2616.0  
##                                             Max.   :40000   Max.   :41630.4  
##                                                                              
##  paid_principal    paid_interest    paid_late_fees   
##  Min.   :    0.0   Min.   :   0.0   Min.   : 0.0000  
##  1st Qu.:  587.1   1st Qu.: 221.8   1st Qu.: 0.0000  
##  Median :  985.0   Median : 446.1   Median : 0.0000  
##  Mean   : 1894.5   Mean   : 599.7   Mean   : 0.1195  
##  3rd Qu.: 1694.6   3rd Qu.: 825.4   3rd Qu.: 0.0000  
##  Max.   :40000.0   Max.   :4216.4   Max.   :52.9800  
## 
## number of NAs in each column
colSums(is.na(loan))
##                        emp_title                       emp_length 
##                                0                              817 
##                            state                    homeownership 
##                                0                                0 
##                    annual_income                  verified_income 
##                                0                                0 
##                   debt_to_income              annual_income_joint 
##                               24                             8505 
##        verification_income_joint             debt_to_income_joint 
##                                0                             8505 
##                        delinq_2y         months_since_last_delinq 
##                                0                             5658 
##             earliest_credit_line               inquiries_last_12m 
##                                0                                0 
##               total_credit_lines                open_credit_lines 
##                                0                                0 
##               total_credit_limit            total_credit_utilized 
##                                0                                0 
##         num_collections_last_12m     num_historical_failed_to_pay 
##                                0                                0 
##            months_since_90d_late          current_accounts_delinq 
##                             7715                                0 
##     total_collection_amount_ever     current_installment_accounts 
##                                0                                0 
##              accounts_opened_24m months_since_last_credit_inquiry 
##                                0                             1271 
##        num_satisfactory_accounts       num_accounts_120d_past_due 
##                                0                              318 
##        num_accounts_30d_past_due        num_active_debit_accounts 
##                                0                                0 
##                total_debit_limit            num_total_cc_accounts 
##                                0                                0 
##             num_open_cc_accounts          num_cc_carrying_balance 
##                                0                                0 
##                num_mort_accounts     account_never_delinq_percent 
##                                0                                0 
##                        tax_liens           public_record_bankrupt 
##                                0                                0 
##                     loan_purpose                 application_type 
##                                0                                0 
##                      loan_amount                             term 
##                                0                                0 
##                    interest_rate                      installment 
##                                0                                0 
##                            grade                        sub_grade 
##                                0                                0 
##                      issue_month                      loan_status 
##                                0                                0 
##           initial_listing_status              disbursement_method 
##                                0                                0 
##                          balance                       paid_total 
##                                0                                0 
##                   paid_principal                    paid_interest 
##                                0                                0 
##                   paid_late_fees 
##                                0
## Check number of zeros of each column
colSums(loan == 0)
##                        emp_title                       emp_length 
##                                0                               NA 
##                            state                    homeownership 
##                                0                                0 
##                    annual_income                  verified_income 
##                               23                                0 
##                   debt_to_income              annual_income_joint 
##                               NA                               NA 
##        verification_income_joint             debt_to_income_joint 
##                                0                               NA 
##                        delinq_2y         months_since_last_delinq 
##                             8576                               NA 
##             earliest_credit_line               inquiries_last_12m 
##                                0                             3060 
##               total_credit_lines                open_credit_lines 
##                                0                                2 
##               total_credit_limit            total_credit_utilized 
##                                2                               28 
##         num_collections_last_12m     num_historical_failed_to_pay 
##                             9873                             8586 
##            months_since_90d_late          current_accounts_delinq 
##                               NA                             9999 
##     total_collection_amount_ever     current_installment_accounts 
##                             8635                             1414 
##              accounts_opened_24m months_since_last_credit_inquiry 
##                              527                               NA 
##        num_satisfactory_accounts       num_accounts_120d_past_due 
##                                2                               NA 
##        num_accounts_30d_past_due        num_active_debit_accounts 
##                             9999                              333 
##                total_debit_limit            num_total_cc_accounts 
##                              135                                0 
##             num_open_cc_accounts          num_cc_carrying_balance 
##                                7                              125 
##                num_mort_accounts     account_never_delinq_percent 
##                             4484                                0 
##                        tax_liens           public_record_bankrupt 
##                             9754                             8785 
##                     loan_purpose                 application_type 
##                                0                                0 
##                      loan_amount                             term 
##                                0                                0 
##                    interest_rate                      installment 
##                                0                                0 
##                            grade                        sub_grade 
##                                0                                0 
##                      issue_month                      loan_status 
##                                0                                0 
##           initial_listing_status              disbursement_method 
##                                0                                0 
##                          balance                       paid_total 
##                              455                               15 
##                   paid_principal                    paid_interest 
##                               15                               18 
##                   paid_late_fees 
##                             9948

Visualizations

I generated 5 plots: * The Most 20 Job Titles Applying for Loans * Correlation Heatmap * Boxplots for Interest Rate Comparison * Annual Income Distribution by Different Loan Purposes * Number of Delinquencies over the Last 2 Years by Homeownerships

Before performing visualizations, I created my own theme called theme1.

library(ggplot2)
cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
theme1 <- theme_bw() +
  theme(axis.text = element_text(size = 8, colour = "#6b3447"),
        axis.title = element_text(size = 10, colour = "#2f2f63"),
        legend.title = element_text(size = 8, colour = "#2f2f63"),
        legend.text = element_text(size = 8, colour = "#6b3447"),
        title = element_text(size = 9, colour = "#2f2f63"),
        axis.ticks = element_line(colour = "#6b3447"),
        plot.caption = element_text(size = 8, colour = "#2f2f63"),
        plot.subtitle = element_text(size = 10, colour = "#2f2f63"))

Plot 1: The Most 20 Job Titles Applying for Loans

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
df_5 <- loan %>%
  filter(!any(is.na(emp_title))) %>% 
  group_by(emp_title) %>%
  summarize(count = n()) %>%
  arrange(desc(count)) %>%
  top_n(21)
## Selecting by count
library(forcats)
ggplot(df_5[-1, ], aes(x = count, y = fct_reorder(emp_title, count))) +
  theme1 +
  geom_col(col = "pink") +
  xlab("Count") +
  ylab("Job Title") +
  ggtitle("The Most 20 Job Titles Applying for Loans") +
  theme(axis.text.x = element_text(size = 8, angle = 0, hjust = 0.5))

From the plot we can see that the most 20 job titles applying for loans are MANAGER, OWNER, TEACHER, DRIVER, SALES, etc.

Plot 2: Correlation Heatmap

library(ggcorrplot)
cor_data <- loan[, c('annual_income', 'earliest_credit_line', 'inquiries_last_12m', 'total_credit_lines', 'open_credit_lines', 'total_credit_limit', 'total_credit_utilized', 'num_historical_failed_to_pay', 'accounts_opened_24m', 'current_installment_accounts', 'num_active_debit_accounts', 'total_debit_limit', 'num_total_cc_accounts', 'num_open_cc_accounts', 'public_record_bankrupt', 'loan_amount', 'term', 'installment', 'balance', 'paid_total', 'paid_interest', 'interest_rate')]

ggcorrplot(cor(cor_data), pch = 1, pch.cex =1, tl.cex = 8) +
  theme1 +
  xlab("Variable List 1") +
  ylab("Variable List 2") +
  ggtitle("Correlation Heatmap") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

The correlation heatmap allows us the see how numeric variables are correlated and also gives us some idea on building models to predict interest rate. From the heatmap, we get the following results:

  • Installment and loan amount are highly positively correlated.
  • Balance and loan amount are highly positively correlated.
  • Balance and installment are highly positively correlated.
  • Total number of currently open credit card accounts and number of currently open lines of credit are highly positively correlated.
  • Interest rate is positively correlated with paid interest, term and negatively correlated with total debit limit, so when building our model to predict the interest rate, we can take these numeric variables into consideration.

Plot 3: Boxplots for Interest Rate Comparison

library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
plot3_verified_income <- ggplot(loan, aes(x = verified_income, y = interest_rate)) +
  theme1 +
  xlab("Verified Income") +
  ylab("Interest Rate") +
  geom_boxplot()
loan$term_cat <- as.character(loan$term)
plot3_term <- ggplot(loan, aes(x = term_cat, y = interest_rate)) +
  theme1 +
  xlab("Term") +
  ylab("Interest Rate") +
  geom_boxplot()
plot3_grade <- ggplot(loan, aes(x = grade, y = interest_rate)) +
  theme1 +
  xlab("Grade") +
  ylab("Interest Rate") +
  geom_boxplot()
plot3_disbursement_method <- ggplot(loan, aes(x = disbursement_method, y = interest_rate)) +
  theme1 +
  xlab("Disbursement Method") +
  ylab("Interest Rate") +
  geom_boxplot()
grid.arrange(plot3_verified_income, plot3_term, plot3_grade, plot3_disbursement_method, nrow = 2)

The boxplots serve to provide us with some ideas on predicting the interest rate. From the boxplots we can see that Grade has a significant impact on interest rate. Whether the income is verified, term and disbursement method also have some influence on the interest rate, so when building our model to predict the interest rate, we can take these categorical variables into consideration.

Plot 4: Annual Income Distribution by Different Loan Purposes

ggplot(loan, aes(x = annual_income)) +
  theme1 +
  geom_histogram(binwidth = 10000, alpha = 1, position = "identity") +
  facet_wrap(~ loan_purpose, scales = "free") +
  coord_cartesian(xlim = c(0, 300000)) +
  xlab("Annual Income") +
  ylab("Count") +
  ggtitle("Annual Income Distribution by Different Loan Purposes") +
  theme(axis.text.x = element_text(angle = 0, hjust = 0.5))

This visualization plots the distribution of annual income for each loan purpose. Before visualizing, I was wondering if there are some differences in annual income distribution for different loan purpose, but the result turns out to be that the annual income distributions are all right skewed for each loan purpose, and there is not any evident difference in mean or median.

However, we do observe that loan purpose of debt consolidation and credit card has the most counts, whereas the loan purpose of renewable energy and moving has the fewest counts.

Plot 5: Number of Delinquencies over the Last 2 Years by Homeownerships

loan$delinq_cat <- cut(loan$delinq_2y,
                       breaks = c(-Inf, 0, 1, max(loan$delinq_2y)),
                       labels = c('0', '1', '>1'))

plot5_stack <- ggplot(loan, aes(x = homeownership)) + 
  theme1 +
  geom_bar(aes(fill = delinq_cat), position = "stack") +
  scale_fill_manual(values = cbPalette, name = "Number of Delinquencies (Last 2 Years)") +
  xlab("Homeownership") +
  ylab("Proportion of Delinquencies") +
  ggtitle("Number of Delinquencies by Homeownerships (Stack Method)") +
  theme(axis.text.x = element_text(size = 8, angle = 0, hjust = 0.5))

plot5_fill <- ggplot(loan, aes(x = homeownership)) +
  theme1 +
  geom_bar(aes(fill = delinq_cat), position = "fill") +
  scale_fill_manual(values = cbPalette, name = "Number of Delinquencies (Last 2 Years)") +
  xlab("Homeownership") +
  ylab("Proportion of Delinquencies") +
  ggtitle("Number of Delinquencies by Homeownerships (Fill Method)") +
  theme(axis.text.x = element_text(size = 8, angle = 0, hjust = 0.5))

grid.arrange(plot5_stack, plot5_fill, nrow = 2)

I was wondering if Homeownership has some influence on Number of Delinquencies (Last 2 Years). From the stack method, we see that mortgage has the most counts and own has the fewest counts, and most applicants do not have any delinquencies over the last town years, but the stack method does not provide me with an obvious answer to my hypothesis. In this case, I replotted the bar chart using the fill method. From the lower plot we see that the homeownership of mortgage has slightly more delinquencies over the last 2 years, and rent has fewer delinquencies. This is an interesting finding and worth further study.

Predictive Model

Algorithm 1: Multiple Linear Regression

The first algorithm we use to predict the interest rate is multiple linear regression. We split the data into 80% training and 20% testing set. We first fit the model using variables we previously identified in correlation heatmap and boxplots, including paid_interest, total_debit_limit, verified_income, grade, term_cat, and disbursement_method. We also added some other variables (inquiries_last_12m, debt_to_income, accounts_opened_24m, and total_credit_limit) some might be relevant in predicting interest rate in our initial linear regression model. If they are not relevant in predicting, we will further remove it based on p-values or BIC.

## Split the loan data into 80% training data and 20% testing data
set.seed(1)
row.number <- sample(1:nrow(loan), 0.8*nrow(loan))
train = loan[row.number, ]
test = loan[-row.number, ]

## Fit our linear regression model
model <- lm(interest_rate ~ inquiries_last_12m + debt_to_income + accounts_opened_24m + total_credit_limit + total_debit_limit + verified_income + grade + term_cat + disbursement_method + paid_interest, data = train)
summary(model)
## 
## Call:
## lm(formula = interest_rate ~ inquiries_last_12m + debt_to_income + 
##     accounts_opened_24m + total_credit_limit + total_debit_limit + 
##     verified_income + grade + term_cat + disbursement_method + 
##     paid_interest, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.1367  -0.7791  -0.0997   0.7995   3.3963 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     6.744e+00  3.766e-02 179.086  < 2e-16 ***
## inquiries_last_12m              1.992e-02  5.708e-03   3.489 0.000487 ***
## debt_to_income                  4.068e-03  8.051e-04   5.053 4.44e-07 ***
## accounts_opened_24m             2.090e-02  4.323e-03   4.836 1.35e-06 ***
## total_credit_limit             -3.092e-07  7.200e-08  -4.295 1.77e-05 ***
## total_debit_limit              -5.251e-06  5.213e-07 -10.073  < 2e-16 ***
## verified_incomeSource Verified  8.762e-02  2.818e-02   3.109 0.001882 ** 
## verified_incomeVerified         1.479e-01  3.378e-02   4.378 1.21e-05 ***
## gradeB                          3.562e+00  3.486e-02 102.169  < 2e-16 ***
## gradeC                          7.088e+00  3.916e-02 181.007  < 2e-16 ***
## gradeD                          1.185e+01  4.828e-02 245.435  < 2e-16 ***
## gradeE                          1.770e+01  7.981e-02 221.765  < 2e-16 ***
## gradeF                          2.183e+01  1.656e-01 131.799  < 2e-16 ***
## gradeG                          2.329e+01  4.107e-01  56.716  < 2e-16 ***
## term_cat60                      9.004e-02  3.063e-02   2.940 0.003293 ** 
## disbursement_methodDirectPay   -3.203e-01  4.775e-02  -6.707 2.12e-11 ***
## paid_interest                   3.004e-04  3.038e-05   9.889  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.076 on 7965 degrees of freedom
##   (18 observations deleted due to missingness)
## Multiple R-squared:  0.953,  Adjusted R-squared:  0.9529 
## F-statistic: 1.01e+04 on 16 and 7965 DF,  p-value: < 2.2e-16

The summary table shows that all variables included in our initial model have a p-value significantly smaller than 0.05, which means that these variables are proved to be statistically significant in predicting interest rate. The R-squared and adjusted R-squared are both around 0.953, which means that 95.3% of the variance for interest rate is explained by variables of our choice in our linear regression model. The result 0.953 is very high, suggesting that our model is good.

Next, we applied BIC method for variable selection above to see if there are some variables that need to be removed from our model.

## BIC
step(model, k = log(nrow(loan)))
## Start:  AIC=1301.68
## interest_rate ~ inquiries_last_12m + debt_to_income + accounts_opened_24m + 
##     total_credit_limit + total_debit_limit + verified_income + 
##     grade + term_cat + disbursement_method + paid_interest
## 
##                       Df Sum of Sq    RSS     AIC
## - term_cat             1        10   9223  1301.1
## <none>                               9213  1301.7
## - verified_income      2        24   9237  1303.8
## - inquiries_last_12m   1        14   9227  1304.7
## - total_credit_limit   1        21   9235  1310.9
## - accounts_opened_24m  1        27   9240  1315.9
## - debt_to_income       1        30   9243  1318.0
## - disbursement_method  1        52   9265  1337.4
## - paid_interest        1       113   9326  1389.9
## - total_debit_limit    1       117   9331  1393.5
## - grade                6    102375 111588 21154.8
## 
## Step:  AIC=1301.12
## interest_rate ~ inquiries_last_12m + debt_to_income + accounts_opened_24m + 
##     total_credit_limit + total_debit_limit + verified_income + 
##     grade + disbursement_method + paid_interest
## 
##                       Df Sum of Sq    RSS     AIC
## <none>                               9223  1301.1
## - verified_income      2        23   9247  1303.0
## - inquiries_last_12m   1        13   9237  1303.5
## - total_credit_limit   1        19   9242  1308.5
## - accounts_opened_24m  1        26   9250  1314.6
## - debt_to_income       1        28   9251  1316.1
## - disbursement_method  1        49   9273  1334.5
## - total_debit_limit    1       114   9337  1389.8
## - paid_interest        1       153   9376  1423.2
## - grade                6    106288 115511 21421.4
## 
## Call:
## lm(formula = interest_rate ~ inquiries_last_12m + debt_to_income + 
##     accounts_opened_24m + total_credit_limit + total_debit_limit + 
##     verified_income + grade + disbursement_method + paid_interest, 
##     data = train)
## 
## Coefficients:
##                    (Intercept)              inquiries_last_12m  
##                      6.736e+00                       1.946e-02  
##                 debt_to_income             accounts_opened_24m  
##                      3.952e-03                       2.057e-02  
##             total_credit_limit               total_debit_limit  
##                     -2.923e-07                      -5.162e-06  
## verified_incomeSource Verified         verified_incomeVerified  
##                      8.741e-02                       1.469e-01  
##                         gradeB                          gradeC  
##                      3.579e+00                       7.112e+00  
##                         gradeD                          gradeE  
##                      1.187e+01                       1.773e+01  
##                         gradeF                          gradeG  
##                      2.187e+01                       2.332e+01  
##   disbursement_methodDirectPay                   paid_interest  
##                     -3.112e-01                       3.298e-04

The BIC method removes the TERM from our model, but based on the result from previous boxplots, we do observe that term has some impacts on the interest, so we decide to keep it in our model.

Model 1 Result Visualizations

Some very important assumption on a linear regression model are:

  • Linearity
  • Homoscedasticity, which means that the error is constant along the values of the dependent variables.
  • No auto-correlation
  • Normality

We first check model assumptions by plotting residuals vs fitted, normal Q-Q plot, scale-location, and residuals vs leverage. We see that apart from one outlier, the residuals are normal and constant, and there is no autocorrelation. Our model does not violate any model assumptions.

## Checking model assumptions
par(mfrow = c(2, 2))
plot(model)

We visualized the predicted result below. As we can see, the predicted results align with the real data well, suggesting that our model is a good fit in predicting the interest rate.

## Visualize the predicted result
pred <- predict(model, newdata = test, interval = "confidence")
pred_fit <- pred[,1]
plot(test$interest_rate, pred_fit, xlab = "Test Data Interest Rate", ylab = "Predicted Interest Rate")

If I have more time, I will try adding interaction terms into our linear regression model to further see if there are any interaction terms that are statistically significant or help to lower adjusted R-squared.

Algorithm 2: Generalized least square with interactions.

This time, we use the generalized least square method, and an interaction term total_credit_limit*total_debit_limit to predict the interest rate.

model2 <- glm(interest_rate ~ inquiries_last_12m + debt_to_income + accounts_opened_24m + total_credit_limit + total_debit_limit + verified_income + grade + term_cat + disbursement_method + paid_interest + total_credit_limit*total_debit_limit, data = train)
summary(model2)
## 
## Call:
## glm(formula = interest_rate ~ inquiries_last_12m + debt_to_income + 
##     accounts_opened_24m + total_credit_limit + total_debit_limit + 
##     verified_income + grade + term_cat + disbursement_method + 
##     paid_interest + total_credit_limit * total_debit_limit, data = train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -13.1661   -0.7718   -0.1044    0.7964    3.2091  
## 
## Coefficients:
##                                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                           6.827e+00  4.055e-02 168.345  < 2e-16 ***
## inquiries_last_12m                    1.952e-02  5.698e-03   3.425 0.000617 ***
## debt_to_income                        4.422e-03  8.063e-04   5.485 4.26e-08 ***
## accounts_opened_24m                   2.291e-02  4.331e-03   5.291 1.25e-07 ***
## total_credit_limit                   -6.640e-07  9.717e-08  -6.833 8.90e-12 ***
## total_debit_limit                    -8.292e-06  7.648e-07 -10.842  < 2e-16 ***
## verified_incomeSource Verified        8.205e-02  2.815e-02   2.915 0.003568 ** 
## verified_incomeVerified               1.371e-01  3.378e-02   4.058 5.00e-05 ***
## gradeB                                3.544e+00  3.494e-02 101.436  < 2e-16 ***
## gradeC                                7.061e+00  3.940e-02 179.205  < 2e-16 ***
## gradeD                                1.181e+01  4.866e-02 242.767  < 2e-16 ***
## gradeE                                1.766e+01  8.005e-02 220.570  < 2e-16 ***
## gradeF                                2.179e+01  1.655e-01 131.603  < 2e-16 ***
## gradeG                                2.327e+01  4.100e-01  56.759  < 2e-16 ***
## term_cat60                            1.033e-01  3.067e-02   3.367 0.000763 ***
## disbursement_methodDirectPay         -3.188e-01  4.766e-02  -6.688 2.41e-11 ***
## paid_interest                         3.146e-04  3.043e-05  10.336  < 2e-16 ***
## total_credit_limit:total_debit_limit  8.820e-12  1.626e-12   5.425 5.96e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 1.15261)
## 
##     Null deviance: 196211.5  on 7981  degrees of freedom
## Residual deviance:   9179.4  on 7964  degrees of freedom
##   (18 observations deleted due to missingness)
## AIC: 23806
## 
## Number of Fisher Scoring iterations: 2

Model 2 Result Visualizations

Again, the GLM model has the same model assumption as the LM model.

Simiarly, we first check model assumptions by plotting residuals vs fitted, normal Q-Q plot, scale-location, and residuals vs leverage. We see that apart from one outlier, the residuals are normal and constant, and there is no autocorrelation. Our model does not violate any model assumptions.

## Checking model assumptions
par(mfrow = c(2, 2))
plot(model2)

We visualized the predicted result below. As we can see, the predicted results align with the real test data well, suggesting that our GLM model with interaction terms is a good fit in predicting the interest rate.

## Visualize the predicted result
pred2 <- predict(model2, newdata = test, interval = "confidence")
plot(test$interest_rate, pred2, xlab = "Test Data Interest Rate", ylab = "Predicted Interest Rate")

Case Study 2

For this case study, I used the sqldf package in r and wrote SQL queries.

df <- read.csv("casestudy.csv")
head(df)
##   X        customer_email net_revenue year
## 1 0  nhknapwsbx@gmail.com      249.92 2015
## 2 1  joiuzbvcpn@gmail.com       87.61 2015
## 3 2  ukkjctepxt@gmail.com      168.38 2015
## 4 3  gykatilzrt@gmail.com       62.40 2015
## 5 4  mmsgsrtxah@gmail.com       43.08 2015
## 6 5  mobvusnzfr@gmail.com       39.44 2015

Questions:

Total revenue for the current year

library(sqldf)
## Loading required package: gsubfn
## Loading required package: proto
## Warning in doTryCatch(return(expr), name, parentenv, handler): unable to load shared object '/Library/Frameworks/R.framework/Resources/modules//R_X11.so':
##   dlopen(/Library/Frameworks/R.framework/Resources/modules//R_X11.so, 0x0006): Library not loaded: /opt/X11/lib/libSM.6.dylib
##   Referenced from: /Library/Frameworks/R.framework/Versions/4.2/Resources/modules/R_X11.so
##   Reason: tried: '/opt/X11/lib/libSM.6.dylib' (no such file), '/Library/Frameworks/R.framework/Resources/lib/libSM.6.dylib' (no such file), '/Library/Java/JavaVirtualMachines/jdk1.8.0_241.jdk/Contents/Home/jre/lib/server/libSM.6.dylib' (no such file)
## tcltk DLL is linked to '/opt/X11/lib/libX11.6.dylib'
## Could not load tcltk.  Will use slower R code instead.
## Loading required package: RSQLite
total_revenue <- sqldf('select distinct year as Year, sum(net_revenue) over (partition by year) as "Total_Revenue" from df
                       order by year desc')
total_revenue
##   Year Total_Revenue
## 1 2017      31417495
## 2 2016      25730944
## 3 2015      29036749

New Customer Revenue

sqldf('select year as Year, sum(net_revenue) as "New Customers Total Revenue" from df 
      where year = 2017 and customer_email not in (select customer_email from df where year = 2016)
      union all
      select year as Year, sum(net_revenue) as "New Customers Total Revenue" from df
      where year = 2016 and customer_email not in (select customer_email from df where year = 2015)')
##   Year New Customers Total Revenue
## 1 2017                    28776235
## 2 2016                    18245491

Existing Customer Growth

Since we don’t have the data for 2014, we don’t know the changes in customers from 2014 to 2015 and hence don’t know the existing customers of 2015. In this case, we only calculate the revenue for existing customers in 2017 and 2016 and then subtract the revenue of existing customers in 2016 from the revenue of existing customers in 2017 to obtain existing customer growth, which is -4,844,193.

existing_cus_rev <- sqldf('select year as Year, sum(net_revenue) as "Existing Customers Total Revenue" from df
                          where year = 2017 and customer_email in (select customer_email from df where year = 2016)
                          union all
                          select year as Year, sum(net_revenue) as "Existing Customers Total Revenue" from df
                          where year = 2016 and customer_email in (select customer_email from df where year = 2015)')
existing_cus_rev
##   Year Existing Customers Total Revenue
## 1 2017                          2641260
## 2 2016                          7485453
existing_cus_growth <- 2641260 - 7485453
existing_cus_growth
## [1] -4844193

Revenue Lost from Attrition

## Revenue of Customers who are presented in 2016 but not in 2017
rev_loss_attrition_2017 <- sqldf('select sum(net_revenue) as "Revenue Lost from Attrition from 2016 to 2017" from df
                        where year = 2016 and customer_email not in (select customer_email from df where year = 2017)')
rev_loss_attrition_2017
##   Revenue Lost from Attrition from 2016 to 2017
## 1                                      23110295
## Revenue of Customers who are presented in 2015 but not in 2016
rev_loss_attrition_2016 <- sqldf('select sum(net_revenue) as "Revenue Lost from Attrition from 2015 to 2016" from df
                        where year = 2015 and customer_email not in (select customer_email from df where year = 2016)')
rev_loss_attrition_2016
##   Revenue Lost from Attrition from 2015 to 2016
## 1                                      21571632

Existing Customer Revenue Current Year

existing_cus_rev <- sqldf('select year as Year, sum(net_revenue) as "Existing Customers Total Revenue" from df
                          where year = 2017 and customer_email in (select customer_email from df where year = 2016)
                          union all
                          select year as Year, sum(net_revenue) as "Existing Customers Total Revenue" from df
                          where year = 2016 and customer_email in (select customer_email from df where year = 2015)')
existing_cus_rev
##   Year Existing Customers Total Revenue
## 1 2017                          2641260
## 2 2016                          7485453

Existing Customer Revenue Prior Year

sqldf('select Year, lead("Existing Customers Total Revenue", 1) over(order by Year desc) as "Existing Customer Revenue Prior Year" from existing_cus_rev')
##   Year Existing Customer Revenue Prior Year
## 1 2017                              7485453
## 2 2016                                   NA

Total Customers Current Year

total_cus <- sqldf('select distinct year as Year, count(customer_email) over(partition by year) as "Total_Customers_Current_Year" from df
                    order by year desc')
total_cus
##   Year Total_Customers_Current_Year
## 1 2017                       249987
## 2 2016                       204646
## 3 2015                       231294

Total Customers Previous Year

The “Total Customers Previous Year” of 2017 is the total customers of 2016.

The “Total Customers Previous Year” of 2016 is the total customers of 2015.

So I use the lead window function in SQL.

sqldf('with cte as (select distinct year as Year, count(customer_email) over(partition by year) as "Total Customers Current Year" from df
      order by year desc)
      select Year, lead("Total Customers Current Year", 1) over(order by Year desc) as "Total Customers Previous Year" from cte')
##   Year Total Customers Previous Year
## 1 2017                        204646
## 2 2016                        231294
## 3 2015                            NA

New Customers

For this question to save space, I only show 5 new customers each year using “limit 5”.

## New Customers of 2017
sqldf('select customer_email as "New Customers of 2017" from df 
      where year = 2017 and customer_email not in (select customer_email from df where year = 2016)
      limit 5')
##   New Customers of 2017
## 1  mwrossuukz@gmail.com
## 2 qvjfterwnk@gmail.com 
## 3  vlyigtgfzs@gmail.com
## 4  yfirychuhk@gmail.com
## 5  fdkiioqtli@gmail.com
## New Customers of 2016
sqldf('select customer_email as "New Customers of 2016" from df 
      where year = 2016 and customer_email not in (select customer_email from df where year = 2015)
      limit 5')
##   New Customers of 2016
## 1  mwrossuukz@gmail.com
## 2  gkwsoupawk@gmail.com
## 3  vlyigtgfzs@gmail.com
## 4  yfirychuhk@gmail.com
## 5  trnzgqinuu@gmail.com

Lost Customers

For this question, to save space, I only show 5 lost customers each year using “limit 5”.

## Customers who are presented in 2016 but not in 2017 are Lost Customers from 2016 to 2017
sqldf('select customer_email from df
      where year = 2016 and customer_email not in (select customer_email from df where year = 2017)
      limit 5')
##          customer_email
## 1  mwrossuukz@gmail.com
## 2  gkwsoupawk@gmail.com
## 3  vlyigtgfzs@gmail.com
## 4  yfirychuhk@gmail.com
## 5  trnzgqinuu@gmail.com
## Customers who are presented in 2015 but not in 2016 are Lost Customers from 2015 to 2016
sqldf('select customer_email from df
      where year = 2015 and customer_email not in (select customer_email from df where year = 2016)
      limit 5')
##          customer_email
## 1  nhknapwsbx@gmail.com
## 2  joiuzbvcpn@gmail.com
## 3  ukkjctepxt@gmail.com
## 4  gykatilzrt@gmail.com
## 5  mmsgsrtxah@gmail.com

Visualizations

Plot 1: Total Revenue Comparison

ggplot(total_revenue, aes(x = Year, y = Total_Revenue)) +
  theme1 +
  geom_line() +
  ylim(20000000, 40000000) +
  xlab("Year") +
  ylab("Total Revenue") +
  ggtitle("Total Revenue Each Year") +
  theme(axis.text.x = element_text(size = 8, angle = 0, hjust = 0.5))

Plot 2: Total Customer Comparison

df$year = as.character(df$year)
 ggplot(df, aes(x = year)) +
  theme1 +
  geom_bar() +
  xlab("Year") +
  ylab("Customer Count") +
  ggtitle("Number of Total Customers Each Year") +
  theme(axis.text.x = element_text(size = 8, angle = 0, hjust = 0.5))