There are a total of 10,000 observations and 55 features, including both numeric and categorical variables, associated with the dataset. The summary function below gives a general description of each variable.
There are some NAs for some variables. For example, there are 817 NAs for emp_length (number of years in the job). These NAs appear when the emp_title is blank. This is reasonable since some applicants are not employed or have already retired. There are 8505 NAs for annual_income_joint and debt_to_income_joint, indicating that these people are not joint applicants. There are 5658 NAs for months_since_last_delinq and 7715 NAs for months_since_90d_late. Since the number of NAs exceeds 50% of the total observations, we do not consider these two columns in further study.
In addition, columns current_accounts_delinq, num_accounts_120d_past_due, and num_accounts_30d_past_due have nearly all values equaling 0, so we do not consider these variables in further study.
## Read the Data
loan <- read.csv("loans_full_schema.csv")
head(loan)
## emp_title emp_length state homeownership annual_income
## 1 global config engineer 3 NJ MORTGAGE 90000
## 2 warehouse office clerk 10 HI RENT 40000
## 3 assembly 3 WI RENT 40000
## 4 customer service 1 PA RENT 30000
## 5 security supervisor 10 CA RENT 35000
## 6 NA KY OWN 34000
## verified_income debt_to_income annual_income_joint verification_income_joint
## 1 Verified 18.01 NA
## 2 Not Verified 5.04 NA
## 3 Source Verified 21.15 NA
## 4 Not Verified 10.16 NA
## 5 Verified 57.96 57000 Verified
## 6 Not Verified 6.46 NA
## debt_to_income_joint delinq_2y months_since_last_delinq earliest_credit_line
## 1 NA 0 38 2001
## 2 NA 0 NA 1996
## 3 NA 0 28 2006
## 4 NA 0 NA 2007
## 5 37.66 0 NA 2008
## 6 NA 1 3 1990
## inquiries_last_12m total_credit_lines open_credit_lines total_credit_limit
## 1 6 28 10 70795
## 2 1 30 14 28800
## 3 4 31 10 24193
## 4 0 4 4 25400
## 5 7 22 16 69839
## 6 6 32 12 42100
## total_credit_utilized num_collections_last_12m num_historical_failed_to_pay
## 1 38767 0 0
## 2 4321 0 1
## 3 16000 0 0
## 4 4997 0 1
## 5 52722 0 0
## 6 3898 0 0
## months_since_90d_late current_accounts_delinq total_collection_amount_ever
## 1 38 0 1250
## 2 NA 0 0
## 3 28 0 432
## 4 NA 0 0
## 5 NA 0 0
## 6 60 0 0
## current_installment_accounts accounts_opened_24m
## 1 2 5
## 2 0 11
## 3 1 13
## 4 1 1
## 5 1 6
## 6 0 2
## months_since_last_credit_inquiry num_satisfactory_accounts
## 1 5 10
## 2 8 14
## 3 7 10
## 4 15 4
## 5 4 16
## 6 5 12
## num_accounts_120d_past_due num_accounts_30d_past_due
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
## num_active_debit_accounts total_debit_limit num_total_cc_accounts
## 1 2 11100 14
## 2 3 16500 24
## 3 3 4300 14
## 4 2 19400 3
## 5 10 32700 20
## 6 1 27200 27
## num_open_cc_accounts num_cc_carrying_balance num_mort_accounts
## 1 8 6 1
## 2 14 4 0
## 3 8 6 0
## 4 3 2 0
## 5 15 13 0
## 6 12 5 3
## account_never_delinq_percent tax_liens public_record_bankrupt
## 1 92.9 0 0
## 2 100.0 0 1
## 3 93.5 0 0
## 4 100.0 1 0
## 5 100.0 0 0
## 6 78.1 0 0
## loan_purpose application_type loan_amount term interest_rate
## 1 moving individual 28000 60 14.07
## 2 debt_consolidation individual 5000 36 12.61
## 3 other individual 2000 36 17.09
## 4 debt_consolidation individual 21600 36 6.72
## 5 credit_card joint 23000 36 14.07
## 6 other individual 5000 36 6.72
## installment grade sub_grade issue_month loan_status initial_listing_status
## 1 652.53 C C3 Mar-2018 Current whole
## 2 167.54 C C1 Feb-2018 Current whole
## 3 71.40 D D1 Feb-2018 Current fractional
## 4 664.19 A A3 Jan-2018 Current whole
## 5 786.87 C C3 Mar-2018 Current whole
## 6 153.75 A A3 Jan-2018 Current whole
## disbursement_method balance paid_total paid_principal paid_interest
## 1 Cash 27015.86 1999.33 984.14 1015.19
## 2 Cash 4651.37 499.12 348.63 150.49
## 3 Cash 1824.63 281.80 175.37 106.43
## 4 Cash 18853.26 3312.89 2746.74 566.15
## 5 Cash 21430.15 2324.65 1569.85 754.80
## 6 Cash 4256.71 873.13 743.29 129.84
## paid_late_fees
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
dim(loan)
## [1] 10000 55
summary(loan)
## emp_title emp_length state homeownership
## Length:10000 Min. : 0.00 Length:10000 Length:10000
## Class :character 1st Qu.: 2.00 Class :character Class :character
## Mode :character Median : 6.00 Mode :character Mode :character
## Mean : 5.93
## 3rd Qu.:10.00
## Max. :10.00
## NA's :817
## annual_income verified_income debt_to_income annual_income_joint
## Min. : 0 Length:10000 Min. : 0.00 Min. : 19200
## 1st Qu.: 45000 Class :character 1st Qu.: 11.06 1st Qu.: 86834
## Median : 65000 Mode :character Median : 17.57 Median : 113000
## Mean : 79222 Mean : 19.31 Mean : 127915
## 3rd Qu.: 95000 3rd Qu.: 25.00 3rd Qu.: 151546
## Max. :2300000 Max. :469.09 Max. :1100000
## NA's :24 NA's :8505
## verification_income_joint debt_to_income_joint delinq_2y
## Length:10000 Min. : 0.32 Min. : 0.000
## Class :character 1st Qu.:14.16 1st Qu.: 0.000
## Mode :character Median :19.72 Median : 0.000
## Mean :19.98 Mean : 0.216
## 3rd Qu.:25.50 3rd Qu.: 0.000
## Max. :39.98 Max. :13.000
## NA's :8505
## months_since_last_delinq earliest_credit_line inquiries_last_12m
## Min. : 1.00 Min. :1963 Min. : 0.000
## 1st Qu.: 19.00 1st Qu.:1997 1st Qu.: 0.000
## Median : 34.00 Median :2003 Median : 1.000
## Mean : 36.76 Mean :2001 Mean : 1.958
## 3rd Qu.: 53.00 3rd Qu.:2006 3rd Qu.: 3.000
## Max. :118.00 Max. :2015 Max. :29.000
## NA's :5658
## total_credit_lines open_credit_lines total_credit_limit total_credit_utilized
## Min. : 2.00 Min. : 0.0 Min. : 0 Min. : 0
## 1st Qu.:14.00 1st Qu.: 7.0 1st Qu.: 51594 1st Qu.: 19186
## Median :21.00 Median :10.0 Median : 114667 Median : 36927
## Mean :22.68 Mean :11.4 Mean : 183606 Mean : 51049
## 3rd Qu.:29.00 3rd Qu.:14.0 3rd Qu.: 267550 3rd Qu.: 65421
## Max. :87.00 Max. :51.0 Max. :3386034 Max. :942456
##
## num_collections_last_12m num_historical_failed_to_pay months_since_90d_late
## Min. :0.0000 Min. : 0.0000 Min. : 2.00
## 1st Qu.:0.0000 1st Qu.: 0.0000 1st Qu.: 29.00
## Median :0.0000 Median : 0.0000 Median : 47.00
## Mean :0.0138 Mean : 0.1671 Mean : 46.11
## 3rd Qu.:0.0000 3rd Qu.: 0.0000 3rd Qu.: 63.00
## Max. :3.0000 Max. :52.0000 Max. :128.00
## NA's :7715
## current_accounts_delinq total_collection_amount_ever
## Min. :0e+00 Min. : 0.0
## 1st Qu.:0e+00 1st Qu.: 0.0
## Median :0e+00 Median : 0.0
## Mean :1e-04 Mean : 184.3
## 3rd Qu.:0e+00 3rd Qu.: 0.0
## Max. :1e+00 Max. :199308.0
##
## current_installment_accounts accounts_opened_24m
## Min. : 0.000 Min. : 0.000
## 1st Qu.: 1.000 1st Qu.: 2.000
## Median : 2.000 Median : 4.000
## Mean : 2.664 Mean : 4.376
## 3rd Qu.: 3.000 3rd Qu.: 6.000
## Max. :35.000 Max. :29.000
##
## months_since_last_credit_inquiry num_satisfactory_accounts
## Min. : 0.000 Min. : 0.00
## 1st Qu.: 2.000 1st Qu.: 7.00
## Median : 6.000 Median :10.00
## Mean : 7.341 Mean :11.38
## 3rd Qu.:11.000 3rd Qu.:14.00
## Max. :24.000 Max. :51.00
## NA's :1271
## num_accounts_120d_past_due num_accounts_30d_past_due num_active_debit_accounts
## Min. :0 Min. :0e+00 Min. : 0.000
## 1st Qu.:0 1st Qu.:0e+00 1st Qu.: 2.000
## Median :0 Median :0e+00 Median : 3.000
## Mean :0 Mean :1e-04 Mean : 3.595
## 3rd Qu.:0 3rd Qu.:0e+00 3rd Qu.: 5.000
## Max. :0 Max. :1e+00 Max. :32.000
## NA's :318
## total_debit_limit num_total_cc_accounts num_open_cc_accounts
## Min. : 0 Min. : 2.00 Min. : 0.000
## 1st Qu.: 10000 1st Qu.: 7.00 1st Qu.: 5.000
## Median : 19500 Median :11.00 Median : 7.000
## Mean : 27357 Mean :13.03 Mean : 8.095
## 3rd Qu.: 36100 3rd Qu.:17.00 3rd Qu.:10.000
## Max. :386700 Max. :66.00 Max. :46.000
##
## num_cc_carrying_balance num_mort_accounts account_never_delinq_percent
## Min. : 0.000 Min. : 0.000 Min. : 14.30
## 1st Qu.: 3.000 1st Qu.: 0.000 1st Qu.: 92.60
## Median : 5.000 Median : 1.000 Median :100.00
## Mean : 5.231 Mean : 1.383 Mean : 94.65
## 3rd Qu.: 7.000 3rd Qu.: 2.000 3rd Qu.:100.00
## Max. :43.000 Max. :14.000 Max. :100.00
##
## tax_liens public_record_bankrupt loan_purpose application_type
## Min. : 0.0000 Min. :0.0000 Length:10000 Length:10000
## 1st Qu.: 0.0000 1st Qu.:0.0000 Class :character Class :character
## Median : 0.0000 Median :0.0000 Mode :character Mode :character
## Mean : 0.0433 Mean :0.1238
## 3rd Qu.: 0.0000 3rd Qu.:0.0000
## Max. :52.0000 Max. :3.0000
##
## loan_amount term interest_rate installment
## Min. : 1000 Min. :36.00 Min. : 5.31 Min. : 30.75
## 1st Qu.: 8000 1st Qu.:36.00 1st Qu.: 9.43 1st Qu.: 256.04
## Median :14500 Median :36.00 Median :11.98 Median : 398.42
## Mean :16362 Mean :43.27 Mean :12.43 Mean : 476.21
## 3rd Qu.:24000 3rd Qu.:60.00 3rd Qu.:15.05 3rd Qu.: 644.69
## Max. :40000 Max. :60.00 Max. :30.94 Max. :1566.59
##
## grade sub_grade issue_month loan_status
## Length:10000 Length:10000 Length:10000 Length:10000
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## initial_listing_status disbursement_method balance paid_total
## Length:10000 Length:10000 Min. : 0 Min. : 0.0
## Class :character Class :character 1st Qu.: 6679 1st Qu.: 928.7
## Mode :character Mode :character Median :12380 Median : 1563.3
## Mean :14459 Mean : 2494.2
## 3rd Qu.:20690 3rd Qu.: 2616.0
## Max. :40000 Max. :41630.4
##
## paid_principal paid_interest paid_late_fees
## Min. : 0.0 Min. : 0.0 Min. : 0.0000
## 1st Qu.: 587.1 1st Qu.: 221.8 1st Qu.: 0.0000
## Median : 985.0 Median : 446.1 Median : 0.0000
## Mean : 1894.5 Mean : 599.7 Mean : 0.1195
## 3rd Qu.: 1694.6 3rd Qu.: 825.4 3rd Qu.: 0.0000
## Max. :40000.0 Max. :4216.4 Max. :52.9800
##
## number of NAs in each column
colSums(is.na(loan))
## emp_title emp_length
## 0 817
## state homeownership
## 0 0
## annual_income verified_income
## 0 0
## debt_to_income annual_income_joint
## 24 8505
## verification_income_joint debt_to_income_joint
## 0 8505
## delinq_2y months_since_last_delinq
## 0 5658
## earliest_credit_line inquiries_last_12m
## 0 0
## total_credit_lines open_credit_lines
## 0 0
## total_credit_limit total_credit_utilized
## 0 0
## num_collections_last_12m num_historical_failed_to_pay
## 0 0
## months_since_90d_late current_accounts_delinq
## 7715 0
## total_collection_amount_ever current_installment_accounts
## 0 0
## accounts_opened_24m months_since_last_credit_inquiry
## 0 1271
## num_satisfactory_accounts num_accounts_120d_past_due
## 0 318
## num_accounts_30d_past_due num_active_debit_accounts
## 0 0
## total_debit_limit num_total_cc_accounts
## 0 0
## num_open_cc_accounts num_cc_carrying_balance
## 0 0
## num_mort_accounts account_never_delinq_percent
## 0 0
## tax_liens public_record_bankrupt
## 0 0
## loan_purpose application_type
## 0 0
## loan_amount term
## 0 0
## interest_rate installment
## 0 0
## grade sub_grade
## 0 0
## issue_month loan_status
## 0 0
## initial_listing_status disbursement_method
## 0 0
## balance paid_total
## 0 0
## paid_principal paid_interest
## 0 0
## paid_late_fees
## 0
## Check number of zeros of each column
colSums(loan == 0)
## emp_title emp_length
## 0 NA
## state homeownership
## 0 0
## annual_income verified_income
## 23 0
## debt_to_income annual_income_joint
## NA NA
## verification_income_joint debt_to_income_joint
## 0 NA
## delinq_2y months_since_last_delinq
## 8576 NA
## earliest_credit_line inquiries_last_12m
## 0 3060
## total_credit_lines open_credit_lines
## 0 2
## total_credit_limit total_credit_utilized
## 2 28
## num_collections_last_12m num_historical_failed_to_pay
## 9873 8586
## months_since_90d_late current_accounts_delinq
## NA 9999
## total_collection_amount_ever current_installment_accounts
## 8635 1414
## accounts_opened_24m months_since_last_credit_inquiry
## 527 NA
## num_satisfactory_accounts num_accounts_120d_past_due
## 2 NA
## num_accounts_30d_past_due num_active_debit_accounts
## 9999 333
## total_debit_limit num_total_cc_accounts
## 135 0
## num_open_cc_accounts num_cc_carrying_balance
## 7 125
## num_mort_accounts account_never_delinq_percent
## 4484 0
## tax_liens public_record_bankrupt
## 9754 8785
## loan_purpose application_type
## 0 0
## loan_amount term
## 0 0
## interest_rate installment
## 0 0
## grade sub_grade
## 0 0
## issue_month loan_status
## 0 0
## initial_listing_status disbursement_method
## 0 0
## balance paid_total
## 455 15
## paid_principal paid_interest
## 15 18
## paid_late_fees
## 9948
I generated 5 plots: * The Most 20 Job Titles Applying for Loans * Correlation Heatmap * Boxplots for Interest Rate Comparison * Annual Income Distribution by Different Loan Purposes * Number of Delinquencies over the Last 2 Years by Homeownerships
Before performing visualizations, I created my own theme called theme1.
library(ggplot2)
cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
theme1 <- theme_bw() +
theme(axis.text = element_text(size = 8, colour = "#6b3447"),
axis.title = element_text(size = 10, colour = "#2f2f63"),
legend.title = element_text(size = 8, colour = "#2f2f63"),
legend.text = element_text(size = 8, colour = "#6b3447"),
title = element_text(size = 9, colour = "#2f2f63"),
axis.ticks = element_line(colour = "#6b3447"),
plot.caption = element_text(size = 8, colour = "#2f2f63"),
plot.subtitle = element_text(size = 10, colour = "#2f2f63"))
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
df_5 <- loan %>%
filter(!any(is.na(emp_title))) %>%
group_by(emp_title) %>%
summarize(count = n()) %>%
arrange(desc(count)) %>%
top_n(21)
## Selecting by count
library(forcats)
ggplot(df_5[-1, ], aes(x = count, y = fct_reorder(emp_title, count))) +
theme1 +
geom_col(col = "pink") +
xlab("Count") +
ylab("Job Title") +
ggtitle("The Most 20 Job Titles Applying for Loans") +
theme(axis.text.x = element_text(size = 8, angle = 0, hjust = 0.5))
From the plot we can see that the most 20 job titles applying for loans
are MANAGER, OWNER, TEACHER, DRIVER, SALES, etc.
library(ggcorrplot)
cor_data <- loan[, c('annual_income', 'earliest_credit_line', 'inquiries_last_12m', 'total_credit_lines', 'open_credit_lines', 'total_credit_limit', 'total_credit_utilized', 'num_historical_failed_to_pay', 'accounts_opened_24m', 'current_installment_accounts', 'num_active_debit_accounts', 'total_debit_limit', 'num_total_cc_accounts', 'num_open_cc_accounts', 'public_record_bankrupt', 'loan_amount', 'term', 'installment', 'balance', 'paid_total', 'paid_interest', 'interest_rate')]
ggcorrplot(cor(cor_data), pch = 1, pch.cex =1, tl.cex = 8) +
theme1 +
xlab("Variable List 1") +
ylab("Variable List 2") +
ggtitle("Correlation Heatmap") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
The correlation heatmap allows us the see how numeric variables are correlated and also gives us some idea on building models to predict interest rate. From the heatmap, we get the following results:
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
plot3_verified_income <- ggplot(loan, aes(x = verified_income, y = interest_rate)) +
theme1 +
xlab("Verified Income") +
ylab("Interest Rate") +
geom_boxplot()
loan$term_cat <- as.character(loan$term)
plot3_term <- ggplot(loan, aes(x = term_cat, y = interest_rate)) +
theme1 +
xlab("Term") +
ylab("Interest Rate") +
geom_boxplot()
plot3_grade <- ggplot(loan, aes(x = grade, y = interest_rate)) +
theme1 +
xlab("Grade") +
ylab("Interest Rate") +
geom_boxplot()
plot3_disbursement_method <- ggplot(loan, aes(x = disbursement_method, y = interest_rate)) +
theme1 +
xlab("Disbursement Method") +
ylab("Interest Rate") +
geom_boxplot()
grid.arrange(plot3_verified_income, plot3_term, plot3_grade, plot3_disbursement_method, nrow = 2)
The boxplots serve to provide us with some ideas on predicting the interest rate. From the boxplots we can see that Grade has a significant impact on interest rate. Whether the income is verified, term and disbursement method also have some influence on the interest rate, so when building our model to predict the interest rate, we can take these categorical variables into consideration.
ggplot(loan, aes(x = annual_income)) +
theme1 +
geom_histogram(binwidth = 10000, alpha = 1, position = "identity") +
facet_wrap(~ loan_purpose, scales = "free") +
coord_cartesian(xlim = c(0, 300000)) +
xlab("Annual Income") +
ylab("Count") +
ggtitle("Annual Income Distribution by Different Loan Purposes") +
theme(axis.text.x = element_text(angle = 0, hjust = 0.5))
This visualization plots the distribution of annual income for each loan purpose. Before visualizing, I was wondering if there are some differences in annual income distribution for different loan purpose, but the result turns out to be that the annual income distributions are all right skewed for each loan purpose, and there is not any evident difference in mean or median.
However, we do observe that loan purpose of debt consolidation and credit card has the most counts, whereas the loan purpose of renewable energy and moving has the fewest counts.
loan$delinq_cat <- cut(loan$delinq_2y,
breaks = c(-Inf, 0, 1, max(loan$delinq_2y)),
labels = c('0', '1', '>1'))
plot5_stack <- ggplot(loan, aes(x = homeownership)) +
theme1 +
geom_bar(aes(fill = delinq_cat), position = "stack") +
scale_fill_manual(values = cbPalette, name = "Number of Delinquencies (Last 2 Years)") +
xlab("Homeownership") +
ylab("Proportion of Delinquencies") +
ggtitle("Number of Delinquencies by Homeownerships (Stack Method)") +
theme(axis.text.x = element_text(size = 8, angle = 0, hjust = 0.5))
plot5_fill <- ggplot(loan, aes(x = homeownership)) +
theme1 +
geom_bar(aes(fill = delinq_cat), position = "fill") +
scale_fill_manual(values = cbPalette, name = "Number of Delinquencies (Last 2 Years)") +
xlab("Homeownership") +
ylab("Proportion of Delinquencies") +
ggtitle("Number of Delinquencies by Homeownerships (Fill Method)") +
theme(axis.text.x = element_text(size = 8, angle = 0, hjust = 0.5))
grid.arrange(plot5_stack, plot5_fill, nrow = 2)
I was wondering if Homeownership has some influence on Number of Delinquencies (Last 2 Years). From the stack method, we see that mortgage has the most counts and own has the fewest counts, and most applicants do not have any delinquencies over the last town years, but the stack method does not provide me with an obvious answer to my hypothesis. In this case, I replotted the bar chart using the fill method. From the lower plot we see that the homeownership of mortgage has slightly more delinquencies over the last 2 years, and rent has fewer delinquencies. This is an interesting finding and worth further study.
The first algorithm we use to predict the interest rate is multiple linear regression. We split the data into 80% training and 20% testing set. We first fit the model using variables we previously identified in correlation heatmap and boxplots, including paid_interest, total_debit_limit, verified_income, grade, term_cat, and disbursement_method. We also added some other variables (inquiries_last_12m, debt_to_income, accounts_opened_24m, and total_credit_limit) some might be relevant in predicting interest rate in our initial linear regression model. If they are not relevant in predicting, we will further remove it based on p-values or BIC.
## Split the loan data into 80% training data and 20% testing data
set.seed(1)
row.number <- sample(1:nrow(loan), 0.8*nrow(loan))
train = loan[row.number, ]
test = loan[-row.number, ]
## Fit our linear regression model
model <- lm(interest_rate ~ inquiries_last_12m + debt_to_income + accounts_opened_24m + total_credit_limit + total_debit_limit + verified_income + grade + term_cat + disbursement_method + paid_interest, data = train)
summary(model)
##
## Call:
## lm(formula = interest_rate ~ inquiries_last_12m + debt_to_income +
## accounts_opened_24m + total_credit_limit + total_debit_limit +
## verified_income + grade + term_cat + disbursement_method +
## paid_interest, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.1367 -0.7791 -0.0997 0.7995 3.3963
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.744e+00 3.766e-02 179.086 < 2e-16 ***
## inquiries_last_12m 1.992e-02 5.708e-03 3.489 0.000487 ***
## debt_to_income 4.068e-03 8.051e-04 5.053 4.44e-07 ***
## accounts_opened_24m 2.090e-02 4.323e-03 4.836 1.35e-06 ***
## total_credit_limit -3.092e-07 7.200e-08 -4.295 1.77e-05 ***
## total_debit_limit -5.251e-06 5.213e-07 -10.073 < 2e-16 ***
## verified_incomeSource Verified 8.762e-02 2.818e-02 3.109 0.001882 **
## verified_incomeVerified 1.479e-01 3.378e-02 4.378 1.21e-05 ***
## gradeB 3.562e+00 3.486e-02 102.169 < 2e-16 ***
## gradeC 7.088e+00 3.916e-02 181.007 < 2e-16 ***
## gradeD 1.185e+01 4.828e-02 245.435 < 2e-16 ***
## gradeE 1.770e+01 7.981e-02 221.765 < 2e-16 ***
## gradeF 2.183e+01 1.656e-01 131.799 < 2e-16 ***
## gradeG 2.329e+01 4.107e-01 56.716 < 2e-16 ***
## term_cat60 9.004e-02 3.063e-02 2.940 0.003293 **
## disbursement_methodDirectPay -3.203e-01 4.775e-02 -6.707 2.12e-11 ***
## paid_interest 3.004e-04 3.038e-05 9.889 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.076 on 7965 degrees of freedom
## (18 observations deleted due to missingness)
## Multiple R-squared: 0.953, Adjusted R-squared: 0.9529
## F-statistic: 1.01e+04 on 16 and 7965 DF, p-value: < 2.2e-16
The summary table shows that all variables included in our initial model have a p-value significantly smaller than 0.05, which means that these variables are proved to be statistically significant in predicting interest rate. The R-squared and adjusted R-squared are both around 0.953, which means that 95.3% of the variance for interest rate is explained by variables of our choice in our linear regression model. The result 0.953 is very high, suggesting that our model is good.
Next, we applied BIC method for variable selection above to see if there are some variables that need to be removed from our model.
## BIC
step(model, k = log(nrow(loan)))
## Start: AIC=1301.68
## interest_rate ~ inquiries_last_12m + debt_to_income + accounts_opened_24m +
## total_credit_limit + total_debit_limit + verified_income +
## grade + term_cat + disbursement_method + paid_interest
##
## Df Sum of Sq RSS AIC
## - term_cat 1 10 9223 1301.1
## <none> 9213 1301.7
## - verified_income 2 24 9237 1303.8
## - inquiries_last_12m 1 14 9227 1304.7
## - total_credit_limit 1 21 9235 1310.9
## - accounts_opened_24m 1 27 9240 1315.9
## - debt_to_income 1 30 9243 1318.0
## - disbursement_method 1 52 9265 1337.4
## - paid_interest 1 113 9326 1389.9
## - total_debit_limit 1 117 9331 1393.5
## - grade 6 102375 111588 21154.8
##
## Step: AIC=1301.12
## interest_rate ~ inquiries_last_12m + debt_to_income + accounts_opened_24m +
## total_credit_limit + total_debit_limit + verified_income +
## grade + disbursement_method + paid_interest
##
## Df Sum of Sq RSS AIC
## <none> 9223 1301.1
## - verified_income 2 23 9247 1303.0
## - inquiries_last_12m 1 13 9237 1303.5
## - total_credit_limit 1 19 9242 1308.5
## - accounts_opened_24m 1 26 9250 1314.6
## - debt_to_income 1 28 9251 1316.1
## - disbursement_method 1 49 9273 1334.5
## - total_debit_limit 1 114 9337 1389.8
## - paid_interest 1 153 9376 1423.2
## - grade 6 106288 115511 21421.4
##
## Call:
## lm(formula = interest_rate ~ inquiries_last_12m + debt_to_income +
## accounts_opened_24m + total_credit_limit + total_debit_limit +
## verified_income + grade + disbursement_method + paid_interest,
## data = train)
##
## Coefficients:
## (Intercept) inquiries_last_12m
## 6.736e+00 1.946e-02
## debt_to_income accounts_opened_24m
## 3.952e-03 2.057e-02
## total_credit_limit total_debit_limit
## -2.923e-07 -5.162e-06
## verified_incomeSource Verified verified_incomeVerified
## 8.741e-02 1.469e-01
## gradeB gradeC
## 3.579e+00 7.112e+00
## gradeD gradeE
## 1.187e+01 1.773e+01
## gradeF gradeG
## 2.187e+01 2.332e+01
## disbursement_methodDirectPay paid_interest
## -3.112e-01 3.298e-04
The BIC method removes the TERM from our model, but based on the result from previous boxplots, we do observe that term has some impacts on the interest, so we decide to keep it in our model.
Some very important assumption on a linear regression model are:
We first check model assumptions by plotting residuals vs fitted, normal Q-Q plot, scale-location, and residuals vs leverage. We see that apart from one outlier, the residuals are normal and constant, and there is no autocorrelation. Our model does not violate any model assumptions.
## Checking model assumptions
par(mfrow = c(2, 2))
plot(model)
We visualized the predicted result below. As we can see, the predicted results align with the real data well, suggesting that our model is a good fit in predicting the interest rate.
## Visualize the predicted result
pred <- predict(model, newdata = test, interval = "confidence")
pred_fit <- pred[,1]
plot(test$interest_rate, pred_fit, xlab = "Test Data Interest Rate", ylab = "Predicted Interest Rate")
If I have more time, I will try adding interaction terms into our linear regression model to further see if there are any interaction terms that are statistically significant or help to lower adjusted R-squared.
This time, we use the generalized least square method, and an interaction term total_credit_limit*total_debit_limit to predict the interest rate.
model2 <- glm(interest_rate ~ inquiries_last_12m + debt_to_income + accounts_opened_24m + total_credit_limit + total_debit_limit + verified_income + grade + term_cat + disbursement_method + paid_interest + total_credit_limit*total_debit_limit, data = train)
summary(model2)
##
## Call:
## glm(formula = interest_rate ~ inquiries_last_12m + debt_to_income +
## accounts_opened_24m + total_credit_limit + total_debit_limit +
## verified_income + grade + term_cat + disbursement_method +
## paid_interest + total_credit_limit * total_debit_limit, data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -13.1661 -0.7718 -0.1044 0.7964 3.2091
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.827e+00 4.055e-02 168.345 < 2e-16 ***
## inquiries_last_12m 1.952e-02 5.698e-03 3.425 0.000617 ***
## debt_to_income 4.422e-03 8.063e-04 5.485 4.26e-08 ***
## accounts_opened_24m 2.291e-02 4.331e-03 5.291 1.25e-07 ***
## total_credit_limit -6.640e-07 9.717e-08 -6.833 8.90e-12 ***
## total_debit_limit -8.292e-06 7.648e-07 -10.842 < 2e-16 ***
## verified_incomeSource Verified 8.205e-02 2.815e-02 2.915 0.003568 **
## verified_incomeVerified 1.371e-01 3.378e-02 4.058 5.00e-05 ***
## gradeB 3.544e+00 3.494e-02 101.436 < 2e-16 ***
## gradeC 7.061e+00 3.940e-02 179.205 < 2e-16 ***
## gradeD 1.181e+01 4.866e-02 242.767 < 2e-16 ***
## gradeE 1.766e+01 8.005e-02 220.570 < 2e-16 ***
## gradeF 2.179e+01 1.655e-01 131.603 < 2e-16 ***
## gradeG 2.327e+01 4.100e-01 56.759 < 2e-16 ***
## term_cat60 1.033e-01 3.067e-02 3.367 0.000763 ***
## disbursement_methodDirectPay -3.188e-01 4.766e-02 -6.688 2.41e-11 ***
## paid_interest 3.146e-04 3.043e-05 10.336 < 2e-16 ***
## total_credit_limit:total_debit_limit 8.820e-12 1.626e-12 5.425 5.96e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 1.15261)
##
## Null deviance: 196211.5 on 7981 degrees of freedom
## Residual deviance: 9179.4 on 7964 degrees of freedom
## (18 observations deleted due to missingness)
## AIC: 23806
##
## Number of Fisher Scoring iterations: 2
Again, the GLM model has the same model assumption as the LM model.
Simiarly, we first check model assumptions by plotting residuals vs fitted, normal Q-Q plot, scale-location, and residuals vs leverage. We see that apart from one outlier, the residuals are normal and constant, and there is no autocorrelation. Our model does not violate any model assumptions.
## Checking model assumptions
par(mfrow = c(2, 2))
plot(model2)
We visualized the predicted result below. As we can see, the predicted results align with the real test data well, suggesting that our GLM model with interaction terms is a good fit in predicting the interest rate.
## Visualize the predicted result
pred2 <- predict(model2, newdata = test, interval = "confidence")
plot(test$interest_rate, pred2, xlab = "Test Data Interest Rate", ylab = "Predicted Interest Rate")
For this case study, I used the sqldf package in r and wrote SQL queries.
df <- read.csv("casestudy.csv")
head(df)
## X customer_email net_revenue year
## 1 0 nhknapwsbx@gmail.com 249.92 2015
## 2 1 joiuzbvcpn@gmail.com 87.61 2015
## 3 2 ukkjctepxt@gmail.com 168.38 2015
## 4 3 gykatilzrt@gmail.com 62.40 2015
## 5 4 mmsgsrtxah@gmail.com 43.08 2015
## 6 5 mobvusnzfr@gmail.com 39.44 2015
library(sqldf)
## Loading required package: gsubfn
## Loading required package: proto
## Warning in doTryCatch(return(expr), name, parentenv, handler): unable to load shared object '/Library/Frameworks/R.framework/Resources/modules//R_X11.so':
## dlopen(/Library/Frameworks/R.framework/Resources/modules//R_X11.so, 0x0006): Library not loaded: /opt/X11/lib/libSM.6.dylib
## Referenced from: /Library/Frameworks/R.framework/Versions/4.2/Resources/modules/R_X11.so
## Reason: tried: '/opt/X11/lib/libSM.6.dylib' (no such file), '/Library/Frameworks/R.framework/Resources/lib/libSM.6.dylib' (no such file), '/Library/Java/JavaVirtualMachines/jdk1.8.0_241.jdk/Contents/Home/jre/lib/server/libSM.6.dylib' (no such file)
## tcltk DLL is linked to '/opt/X11/lib/libX11.6.dylib'
## Could not load tcltk. Will use slower R code instead.
## Loading required package: RSQLite
total_revenue <- sqldf('select distinct year as Year, sum(net_revenue) over (partition by year) as "Total_Revenue" from df
order by year desc')
total_revenue
## Year Total_Revenue
## 1 2017 31417495
## 2 2016 25730944
## 3 2015 29036749
sqldf('select year as Year, sum(net_revenue) as "New Customers Total Revenue" from df
where year = 2017 and customer_email not in (select customer_email from df where year = 2016)
union all
select year as Year, sum(net_revenue) as "New Customers Total Revenue" from df
where year = 2016 and customer_email not in (select customer_email from df where year = 2015)')
## Year New Customers Total Revenue
## 1 2017 28776235
## 2 2016 18245491
Since we don’t have the data for 2014, we don’t know the changes in customers from 2014 to 2015 and hence don’t know the existing customers of 2015. In this case, we only calculate the revenue for existing customers in 2017 and 2016 and then subtract the revenue of existing customers in 2016 from the revenue of existing customers in 2017 to obtain existing customer growth, which is -4,844,193.
existing_cus_rev <- sqldf('select year as Year, sum(net_revenue) as "Existing Customers Total Revenue" from df
where year = 2017 and customer_email in (select customer_email from df where year = 2016)
union all
select year as Year, sum(net_revenue) as "Existing Customers Total Revenue" from df
where year = 2016 and customer_email in (select customer_email from df where year = 2015)')
existing_cus_rev
## Year Existing Customers Total Revenue
## 1 2017 2641260
## 2 2016 7485453
existing_cus_growth <- 2641260 - 7485453
existing_cus_growth
## [1] -4844193
## Revenue of Customers who are presented in 2016 but not in 2017
rev_loss_attrition_2017 <- sqldf('select sum(net_revenue) as "Revenue Lost from Attrition from 2016 to 2017" from df
where year = 2016 and customer_email not in (select customer_email from df where year = 2017)')
rev_loss_attrition_2017
## Revenue Lost from Attrition from 2016 to 2017
## 1 23110295
## Revenue of Customers who are presented in 2015 but not in 2016
rev_loss_attrition_2016 <- sqldf('select sum(net_revenue) as "Revenue Lost from Attrition from 2015 to 2016" from df
where year = 2015 and customer_email not in (select customer_email from df where year = 2016)')
rev_loss_attrition_2016
## Revenue Lost from Attrition from 2015 to 2016
## 1 21571632
existing_cus_rev <- sqldf('select year as Year, sum(net_revenue) as "Existing Customers Total Revenue" from df
where year = 2017 and customer_email in (select customer_email from df where year = 2016)
union all
select year as Year, sum(net_revenue) as "Existing Customers Total Revenue" from df
where year = 2016 and customer_email in (select customer_email from df where year = 2015)')
existing_cus_rev
## Year Existing Customers Total Revenue
## 1 2017 2641260
## 2 2016 7485453
sqldf('select Year, lead("Existing Customers Total Revenue", 1) over(order by Year desc) as "Existing Customer Revenue Prior Year" from existing_cus_rev')
## Year Existing Customer Revenue Prior Year
## 1 2017 7485453
## 2 2016 NA
total_cus <- sqldf('select distinct year as Year, count(customer_email) over(partition by year) as "Total_Customers_Current_Year" from df
order by year desc')
total_cus
## Year Total_Customers_Current_Year
## 1 2017 249987
## 2 2016 204646
## 3 2015 231294
The “Total Customers Previous Year” of 2017 is the total customers of 2016.
The “Total Customers Previous Year” of 2016 is the total customers of 2015.
So I use the lead window function in SQL.
sqldf('with cte as (select distinct year as Year, count(customer_email) over(partition by year) as "Total Customers Current Year" from df
order by year desc)
select Year, lead("Total Customers Current Year", 1) over(order by Year desc) as "Total Customers Previous Year" from cte')
## Year Total Customers Previous Year
## 1 2017 204646
## 2 2016 231294
## 3 2015 NA
For this question to save space, I only show 5 new customers each year using “limit 5”.
## New Customers of 2017
sqldf('select customer_email as "New Customers of 2017" from df
where year = 2017 and customer_email not in (select customer_email from df where year = 2016)
limit 5')
## New Customers of 2017
## 1 mwrossuukz@gmail.com
## 2 qvjfterwnk@gmail.com
## 3 vlyigtgfzs@gmail.com
## 4 yfirychuhk@gmail.com
## 5 fdkiioqtli@gmail.com
## New Customers of 2016
sqldf('select customer_email as "New Customers of 2016" from df
where year = 2016 and customer_email not in (select customer_email from df where year = 2015)
limit 5')
## New Customers of 2016
## 1 mwrossuukz@gmail.com
## 2 gkwsoupawk@gmail.com
## 3 vlyigtgfzs@gmail.com
## 4 yfirychuhk@gmail.com
## 5 trnzgqinuu@gmail.com
For this question, to save space, I only show 5 lost customers each year using “limit 5”.
## Customers who are presented in 2016 but not in 2017 are Lost Customers from 2016 to 2017
sqldf('select customer_email from df
where year = 2016 and customer_email not in (select customer_email from df where year = 2017)
limit 5')
## customer_email
## 1 mwrossuukz@gmail.com
## 2 gkwsoupawk@gmail.com
## 3 vlyigtgfzs@gmail.com
## 4 yfirychuhk@gmail.com
## 5 trnzgqinuu@gmail.com
## Customers who are presented in 2015 but not in 2016 are Lost Customers from 2015 to 2016
sqldf('select customer_email from df
where year = 2015 and customer_email not in (select customer_email from df where year = 2016)
limit 5')
## customer_email
## 1 nhknapwsbx@gmail.com
## 2 joiuzbvcpn@gmail.com
## 3 ukkjctepxt@gmail.com
## 4 gykatilzrt@gmail.com
## 5 mmsgsrtxah@gmail.com
ggplot(total_revenue, aes(x = Year, y = Total_Revenue)) +
theme1 +
geom_line() +
ylim(20000000, 40000000) +
xlab("Year") +
ylab("Total Revenue") +
ggtitle("Total Revenue Each Year") +
theme(axis.text.x = element_text(size = 8, angle = 0, hjust = 0.5))
df$year = as.character(df$year)
ggplot(df, aes(x = year)) +
theme1 +
geom_bar() +
xlab("Year") +
ylab("Customer Count") +
ggtitle("Number of Total Customers Each Year") +
theme(axis.text.x = element_text(size = 8, angle = 0, hjust = 0.5))