Rmd Assignment - Week 5 - Lonie Moore
Introduction
The purpose of this markdown is to publish 5 modified subsets from last week’s assignment inside a .rmd.
Required Packages
The packages required for this markdown are:
library(tidyverse) #the tidyverse collection of packages all together
library(DT) #makes interactive javascript data tables
library(skimr) #has a useful "skim" function for quick summary data
library(dplyr) #also in tidyverse; allows for easy data manipulation in R (filter, select, mutate, group_by, etc)
library(ggplot2) #makes graphsData Overview
Raw Data
Data Sources
Import the data from the following url: https://cheetahanalytics.com/loandata
Data Summary
## Skim summary statistics
## n obs: 10000
## n variables: 84
##
## -- Variable type:character ------------------------------------------------------------------------------------------------------------------------------------
## variable missing complete n min max empty n_unique
## addr_state 0 10000 10000 2 2 0 50
## application_type 0 10000 10000 9 10 0 2
## disbursement_method 0 10000 10000 4 9 0 2
## earliest_cr_line 0 10000 10000 6 6 0 510
## emp_length 0 10000 10000 3 9 0 12
## grade 0 10000 10000 1 1 0 7
## hardship_flag 0 10000 10000 1 1 0 1
## home_ownership 0 10000 10000 3 8 0 4
## initial_list_status 0 10000 10000 1 1 0 2
## issue_d 0 10000 10000 6 6 0 1
## loan_status 0 10000 10000 7 18 0 6
## purpose 0 10000 10000 3 18 0 12
## pymnt_plan 0 10000 10000 1 1 0 1
## sub_grade 0 10000 10000 2 2 0 29
## term 0 10000 10000 9 9 0 2
## title 0 10000 10000 5 23 0 12
## verification_status 0 10000 10000 8 15 0 3
## zip_code 0 10000 10000 5 5 0 788
##
## -- Variable type:numeric --------------------------------------------------------------------------------------------------------------------------------------
## variable missing complete n mean sd
## acc_now_delinq 0 10000 10000 0 0
## acc_open_past_24mths 0 10000 10000 4.44 3.21
## annual_inc 0 10000 10000 85879.56 81465.7
## chargeoff_within_12_mths 0 10000 10000 0.007 0.094
## collection_recovery_fee 0 10000 10000 0 0
## collections_12_mths_ex_med 0 10000 10000 0.018 0.15
## delinq_2yrs 0 10000 10000 0.22 0.7
## delinq_amnt 0 10000 10000 0.21 20.77
## funded_amnt 0 10000 10000 16152.3 10365.72
## funded_amnt_inv 0 10000 10000 16148.55 10368.46
## inq_fi 0 10000 10000 1.17 1.52
## inq_last_12m 0 10000 10000 1.94 2.28
## inq_last_6mths 0 10000 10000 0.51 0.8
## installment 0 10000 10000 466.1 289.94
## int_rate 0 10000 10000 12.68 4.82
## last_pymnt_amnt 0 10000 10000 721.48 2376.85
## loan_amnt 0 10000 10000 16152.3 10365.72
## max_bal_bc 0 10000 10000 6261.49 6714.87
## mo_sin_old_rev_tl_op 0 10000 10000 173.81 98.1
## mo_sin_rcnt_rev_tl_op 0 10000 10000 15.27 19.03
## mo_sin_rcnt_tl 0 10000 10000 8.09 9.14
## mort_acc 0 10000 10000 1.32 1.67
## num_accts_ever_120_pd 0 10000 10000 0.43 1.31
## num_actv_bc_tl 0 10000 10000 3.69 2.44
## num_actv_rev_tl 0 10000 10000 5.44 3.41
## num_bc_sats 0 10000 10000 4.96 3.23
## num_bc_tl 0 10000 10000 7.15 4.57
## num_il_tl 0 10000 10000 8.54 7.34
## num_op_rev_tl 0 10000 10000 8.23 4.9
## num_rev_accts 0 10000 10000 12.94 7.89
## num_rev_tl_bal_gt_0 0 10000 10000 5.42 3.37
## num_sats 0 10000 10000 11.74 5.98
## num_tl_30dpd 0 10000 10000 0 0
## num_tl_90g_dpd_24m 0 10000 10000 0.055 0.36
## num_tl_op_past_12m 0 10000 10000 2.1 1.94
## open_acc 0 10000 10000 11.76 5.98
## open_acc_6m 0 10000 10000 1.02 1.2
## open_act_il 0 10000 10000 2.91 3.03
## open_il_12m 0 10000 10000 0.76 0.97
## open_il_24m 0 10000 10000 1.65 1.57
## open_rv_12m 0 10000 10000 1.21 1.51
## open_rv_24m 0 10000 10000 2.52 2.52
## out_prncp 0 10000 10000 15268.86 10169.62
## out_prncp_inv 0 10000 10000 15265.43 10171.98
## pct_tl_nvr_dlq 0 10000 10000 94.88 8.68
## policy_code 0 10000 10000 1 0
## pub_rec 0 10000 10000 0.12 0.33
## pub_rec_bankruptcies 0 10000 10000 0.12 0.33
## recoveries 0 10000 10000 0 0
## revol_bal 0 10000 10000 17266.16 24886.44
## tax_liens 0 10000 10000 0 0
## tot_coll_amt 0 10000 10000 197.55 2630.98
## tot_cur_bal 0 10000 10000 153290.75 172563.17
## tot_hi_cred_lim 0 10000 10000 2e+05 2e+05
## total_acc 0 10000 10000 23.01 12.15
## total_bal_ex_mort 0 10000 10000 56828.64 57630.35
## total_bal_il 0 10000 10000 39039.59 48709.43
## total_bc_limit 0 10000 10000 29105.17 27509.62
## total_cu_tl 0 10000 10000 1.5 2.61
## total_il_high_credit_limit 0 10000 10000 49930.21 50802.29
## total_pymnt 0 10000 10000 1179.84 2526.32
## total_pymnt_inv 0 10000 10000 1179.46 2526.1
## total_rec_int 0 10000 10000 297.17 232.81
## total_rec_late_fee 0 10000 10000 0.035 1
## total_rec_prncp 0 10000 10000 882.64 2508.79
## total_rev_hi_lim 0 10000 10000 41355.25 41156.37
## p0 p25 p50 p75 p100 hist
## 0 0 0 0 0 <U+2581><U+2581><U+2581><U+2587><U+2581><U+2581><U+2581><U+2581>
## 0 2 4 6 31 <U+2587><U+2587><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581>
## 0 50000 70000 1e+05 5119032 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## 0 0 0 0 4 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## 0 0 0 0 0 <U+2581><U+2581><U+2581><U+2587><U+2581><U+2581><U+2581><U+2581>
## 0 0 0 0 5 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## 0 0 0 0 16 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## 0 0 0 0 2077 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## 1000 8000 14000 22200 40000 <U+2585><U+2587><U+2586><U+2585><U+2583><U+2582><U+2582><U+2582>
## 750 8000 14000 22200 40000 <U+2585><U+2587><U+2586><U+2585><U+2583><U+2582><U+2582><U+2582>
## 0 0 1 2 17 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## 0 0 1 3 51 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## 0 0 0 1 5 <U+2587><U+2583><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## 30.64 248.4 389.12 628.49 1597.6 <U+2585><U+2587><U+2585><U+2583><U+2582><U+2581><U+2581><U+2581>
## 6 8.19 11.8 15.02 30.84 <U+2587><U+2586><U+2585><U+2583><U+2582><U+2581><U+2581><U+2581>
## 0 253.7 399.56 654.75 40910.48 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## 1000 8000 14000 22200 40000 <U+2585><U+2587><U+2586><U+2585><U+2583><U+2582><U+2582><U+2582>
## 0 2342.75 4689.5 8329.5 306125 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## 8 100 157 225 691 <U+2585><U+2587><U+2585><U+2582><U+2581><U+2581><U+2581><U+2581>
## 0 4 9 20 217 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## 0 3 5 11 183 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## 0 0 1 2 14 <U+2587><U+2583><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## 0 0 0 0 24 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## 0 2 3 5 23 <U+2586><U+2587><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581>
## 0 3 5 7 32 <U+2587><U+2586><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581>
## 0 3 4 6 36 <U+2587><U+2586><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## 0 4 6 9 54 <U+2587><U+2586><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## 0 3 7 11 59 <U+2587><U+2585><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581>
## 0 5 7 10 41 <U+2586><U+2587><U+2583><U+2581><U+2581><U+2581><U+2581><U+2581>
## 2 7 11 16.25 81 <U+2587><U+2586><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581>
## 0 3 5 7 32 <U+2587><U+2586><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581>
## 0 8 11 15 59 <U+2585><U+2587><U+2583><U+2581><U+2581><U+2581><U+2581><U+2581>
## 0 0 0 0 0 <U+2581><U+2581><U+2581><U+2587><U+2581><U+2581><U+2581><U+2581>
## 0 0 0 0 16 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## 0 1 2 3 24 <U+2587><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## 0 8 11 15 59 <U+2585><U+2587><U+2583><U+2581><U+2581><U+2581><U+2581><U+2581>
## 0 0 1 2 13 <U+2587><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## 0 1 2 4 32 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## 0 0 0 1 6 <U+2587><U+2585><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581>
## 0 1 1 2 19 <U+2587><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## 0 0 1 2 22 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## 0 1 2 4 28 <U+2587><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## 0 7598.25 13127.7 21312.2 39662.62 <U+2585><U+2587><U+2586><U+2585><U+2583><U+2582><U+2582><U+2582>
## 0 7598.25 13119.87 21312.2 39662.62 <U+2585><U+2587><U+2586><U+2585><U+2583><U+2582><U+2582><U+2582>
## 30.8 92.7 100 100 100 <U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2587>
## 1 1 1 1 1 <U+2581><U+2581><U+2581><U+2587><U+2581><U+2581><U+2581><U+2581>
## 0 0 0 0 3 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## 0 0 0 0 3 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## 0 0 0 0 0 <U+2581><U+2581><U+2581><U+2587><U+2581><U+2581><U+2581><U+2581>
## 0 5754.5 11404 20933 1e+06 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## 0 0 0 0 0 <U+2581><U+2581><U+2581><U+2587><U+2581><U+2581><U+2581><U+2581>
## 0 0 0 0 208593 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## 0 29630.25 81663 232323 2163151 <U+2587><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## 0 56942.5 126331.5 284325 3e+06 <U+2587><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## 2 14 21 30 95 <U+2585><U+2587><U+2585><U+2582><U+2581><U+2581><U+2581><U+2581>
## 0 21243.25 41046 72852 1077617 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## 0 9164 25111 50565.5 590007 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## 0 10900 21000 38400 417400 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## 0 0 0 2 35 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## 0 16349 37231 67178.25 544010 <U+2587><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## 0 475.58 752.27 1241.56 40856.68 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## 0 475.57 752.27 1241.27 40856.68 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## 0 121.11 230.87 413.03 1786.05 <U+2587><U+2585><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581>
## 0 0 0 0 52.99 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## 0 295.88 483.33 789.23 40000 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## 0 18100 31000 52500 1455000 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
Visualizations
Visual1
I use a visualization to demonstrate that interest rates differ significantly by the loan term length. This visualization justifies the need to split out the 36 month loans from the 60 month loans. I’ll then contrast this with a couple other visualizations that a person could have tried to use.
BOXPLOTS: I like the boxplots in this case because it shows the distributions side-by-side on the same graph, which really makes it easy to compare the quartiles between term lengths. From the boxplot, a person can see that the 25th percentile, median, and 75th-percentile approved interest rates are significantly lower for the 36 month term loans.
HISTOGRAMS: The histograms are useful to visualize the general shape of the distributions of approved interest rates by length of term. However, these histograms consist of counts and so they’re not the best for the sake of direct comparison of two distributions other than a quick and dirty eye-ball test to see if this is worth looking into further. A better comparison would show percentages instead of raw counts.
DENSITY CHART: The side-by-side density charts have an advantage over the histograms in that the counts have been converted to something that’s more-readily comparable. However, it’s still difficult to tell from the density charts where the exact percentiles of the curve would fall. So again, the density charts are probably not the best for direct comparison, and I still prefer the box plots in this case.
One may find it even easier to compare these two density plots if we overlayed them on the same graph.
Visual2
For this visualization I’m using another box plot, but this time I’m showing approved interest rates by the derived field “high risk” which is defined as having 5 or more delinquencies in the past 2 years. The mutation function from dplyer is used to create the derived field “high risk”.
#BOXPLOT The boxplots show that customers who fall into the “high risk” category have significantly higher interest rates than customers who fall into the “low risk” category.
SCATTERPLOT with JITTER: The following scatterplot reveals that “grade” can be used to identify different interest rate tiers. You can see that the “high risk” customers appear in each of the “grades” A:E, except for Grade B.
However, it’s interesting to note that the interest rates do not appear to differ substantially within each “grade” whether a person falls into the “high risk” category or not (except for the Grade A category. This observation suggests that there are other significant factors driving which “grade” a customer falls into other than the “high risk” variable that I derived.
Visual3
For this visualization I’m creating a new variable called “high income”, then grouping by that derived variable along with the “grade” variable. Then I filter out some of the lower volume categories such as customers who did not specify a type of home ownership (i.e. home_ownership = “ANY”) as well as the grade = “F” and grade = “G” groups which are very low volume (this will make the visuals a bit easier to consume).
#SCATTERPLOTS USING FACETS: Use faceting to show an Interest Rate versus Loan Amount scatterplot for a crosstabulation of Homeownership versus grade. This is an interesting visual because it shows (very loosely) what could be an interaction effect on approved loan amounts between grade and high income. For example, in the Grade A column, high income customers are associated with higher approved loan amounts than low income customers. However, in the Grade D and Grade E categories, the high income customers do not appear to be getting approved for higher loan amounts than low income customers.
## # A tibble: 9,976 x 7
## loan_amnt int_rate annual_inc home_ownership purpose grade high_inc
## <dbl> <dbl> <dbl> <chr> <chr> <chr> <lgl>
## 1 2500 13.6 55000 RENT debt_consol~ C FALSE
## 2 30000 18.9 90000 MORTGAGE debt_consol~ D FALSE
## 3 5000 18.0 59280 MORTGAGE debt_consol~ D FALSE
## 4 4000 18.9 92000 MORTGAGE debt_consol~ D FALSE
## 5 30000 16.1 57250 MORTGAGE debt_consol~ C FALSE
## 6 5550 15.0 152500 MORTGAGE credit_card C TRUE
## 7 2000 18.0 51000 RENT debt_consol~ D FALSE
## 8 6000 13.6 65000 RENT credit_card C FALSE
## 9 5000 18.0 53580 MORTGAGE debt_consol~ D FALSE
## 10 6000 14.5 300000 OWN debt_consol~ C TRUE
## # ... with 9,966 more rows
Removing the home ownership variable from the faceting makes this a little bit easier to see.
Visual4
For this visualization I want to focus on just the Grade A category (lowest approved rates) and customers who rent their homes. For this subset, I want to investigate whether or not the type of loan requested is an important variable when viewed along with the high income field derived above.
The side-by-side boxplots reveal a couple interesting observations for this subset of customers who rent their homes. First, it’s interesting to note that interest rates do not vary by income class wihtin the credit card and debt loan categories, where the boxplots appear to be nearly identical. Second, there are drastic differences in approved rates within the “medical” and “moving” loan categories.
#BAR GRAPH With the drastic differences we’re seeing in the “medical” and “moving categories”, it might be a good idea to go back and check the sample sizes within these groups to make sure we’re dealing with a large enough data set. Further inspection reveals that the vast majority of our renters are applying for credit cards and debt consolidation. With this in mind, we should probably narrow our investigation to these loan types if we wish to derive further insights that are relevant to this particular subset. If this were a business that we were already familiar with, we might have already had this intuition. When dealing with unfamiliar data as a data scientist, we should never be afraid to go “back to the basics” and look at the count data and basic summary statistics before moving on to more-advanced techniques.
ggplot(data = loandata_mut8, aes(x = high_inc, color = high_inc)) +
geom_bar() +
facet_grid(~ purpose)Visual5
The following visualization compares average revolving balances to average loan amounts, grouped by the high income variable.
SCATTERPLOT OF AVERAGE LOAN AMOUNT VERSUS AVERAGE REVOLVING BALANCE: The following scatterplot shows that high income customers tend to be “clustered” in the higher average revolving balance + high average loan amount area of the chart, while low income customers tend to be "clustered in the low average revolving balances + low average loan amount area of the chart.