Rmd Assignment - Week 5 - Lonie Moore

Data Overview

Raw Data

Data Sources

Import the data from the following url: https://cheetahanalytics.com/loandata

Data Summary

## Skim summary statistics
##  n obs: 10000 
##  n variables: 84 
## 
## -- Variable type:character ------------------------------------------------------------------------------------------------------------------------------------
##             variable missing complete     n min max empty n_unique
##           addr_state       0    10000 10000   2   2     0       50
##     application_type       0    10000 10000   9  10     0        2
##  disbursement_method       0    10000 10000   4   9     0        2
##     earliest_cr_line       0    10000 10000   6   6     0      510
##           emp_length       0    10000 10000   3   9     0       12
##                grade       0    10000 10000   1   1     0        7
##        hardship_flag       0    10000 10000   1   1     0        1
##       home_ownership       0    10000 10000   3   8     0        4
##  initial_list_status       0    10000 10000   1   1     0        2
##              issue_d       0    10000 10000   6   6     0        1
##          loan_status       0    10000 10000   7  18     0        6
##              purpose       0    10000 10000   3  18     0       12
##           pymnt_plan       0    10000 10000   1   1     0        1
##            sub_grade       0    10000 10000   2   2     0       29
##                 term       0    10000 10000   9   9     0        2
##                title       0    10000 10000   5  23     0       12
##  verification_status       0    10000 10000   8  15     0        3
##             zip_code       0    10000 10000   5   5     0      788
## 
## -- Variable type:numeric --------------------------------------------------------------------------------------------------------------------------------------
##                    variable missing complete     n       mean         sd
##              acc_now_delinq       0    10000 10000      0          0    
##        acc_open_past_24mths       0    10000 10000      4.44       3.21 
##                  annual_inc       0    10000 10000  85879.56   81465.7  
##    chargeoff_within_12_mths       0    10000 10000      0.007      0.094
##     collection_recovery_fee       0    10000 10000      0          0    
##  collections_12_mths_ex_med       0    10000 10000      0.018      0.15 
##                 delinq_2yrs       0    10000 10000      0.22       0.7  
##                 delinq_amnt       0    10000 10000      0.21      20.77 
##                 funded_amnt       0    10000 10000  16152.3    10365.72 
##             funded_amnt_inv       0    10000 10000  16148.55   10368.46 
##                      inq_fi       0    10000 10000      1.17       1.52 
##                inq_last_12m       0    10000 10000      1.94       2.28 
##              inq_last_6mths       0    10000 10000      0.51       0.8  
##                 installment       0    10000 10000    466.1      289.94 
##                    int_rate       0    10000 10000     12.68       4.82 
##             last_pymnt_amnt       0    10000 10000    721.48    2376.85 
##                   loan_amnt       0    10000 10000  16152.3    10365.72 
##                  max_bal_bc       0    10000 10000   6261.49    6714.87 
##        mo_sin_old_rev_tl_op       0    10000 10000    173.81      98.1  
##       mo_sin_rcnt_rev_tl_op       0    10000 10000     15.27      19.03 
##              mo_sin_rcnt_tl       0    10000 10000      8.09       9.14 
##                    mort_acc       0    10000 10000      1.32       1.67 
##       num_accts_ever_120_pd       0    10000 10000      0.43       1.31 
##              num_actv_bc_tl       0    10000 10000      3.69       2.44 
##             num_actv_rev_tl       0    10000 10000      5.44       3.41 
##                 num_bc_sats       0    10000 10000      4.96       3.23 
##                   num_bc_tl       0    10000 10000      7.15       4.57 
##                   num_il_tl       0    10000 10000      8.54       7.34 
##               num_op_rev_tl       0    10000 10000      8.23       4.9  
##               num_rev_accts       0    10000 10000     12.94       7.89 
##         num_rev_tl_bal_gt_0       0    10000 10000      5.42       3.37 
##                    num_sats       0    10000 10000     11.74       5.98 
##                num_tl_30dpd       0    10000 10000      0          0    
##          num_tl_90g_dpd_24m       0    10000 10000      0.055      0.36 
##          num_tl_op_past_12m       0    10000 10000      2.1        1.94 
##                    open_acc       0    10000 10000     11.76       5.98 
##                 open_acc_6m       0    10000 10000      1.02       1.2  
##                 open_act_il       0    10000 10000      2.91       3.03 
##                 open_il_12m       0    10000 10000      0.76       0.97 
##                 open_il_24m       0    10000 10000      1.65       1.57 
##                 open_rv_12m       0    10000 10000      1.21       1.51 
##                 open_rv_24m       0    10000 10000      2.52       2.52 
##                   out_prncp       0    10000 10000  15268.86   10169.62 
##               out_prncp_inv       0    10000 10000  15265.43   10171.98 
##              pct_tl_nvr_dlq       0    10000 10000     94.88       8.68 
##                 policy_code       0    10000 10000      1          0    
##                     pub_rec       0    10000 10000      0.12       0.33 
##        pub_rec_bankruptcies       0    10000 10000      0.12       0.33 
##                  recoveries       0    10000 10000      0          0    
##                   revol_bal       0    10000 10000  17266.16   24886.44 
##                   tax_liens       0    10000 10000      0          0    
##                tot_coll_amt       0    10000 10000    197.55    2630.98 
##                 tot_cur_bal       0    10000 10000 153290.75  172563.17 
##             tot_hi_cred_lim       0    10000 10000  2e+05      2e+05    
##                   total_acc       0    10000 10000     23.01      12.15 
##           total_bal_ex_mort       0    10000 10000  56828.64   57630.35 
##                total_bal_il       0    10000 10000  39039.59   48709.43 
##              total_bc_limit       0    10000 10000  29105.17   27509.62 
##                 total_cu_tl       0    10000 10000      1.5        2.61 
##  total_il_high_credit_limit       0    10000 10000  49930.21   50802.29 
##                 total_pymnt       0    10000 10000   1179.84    2526.32 
##             total_pymnt_inv       0    10000 10000   1179.46    2526.1  
##               total_rec_int       0    10000 10000    297.17     232.81 
##          total_rec_late_fee       0    10000 10000      0.035      1    
##             total_rec_prncp       0    10000 10000    882.64    2508.79 
##            total_rev_hi_lim       0    10000 10000  41355.25   41156.37 
##       p0      p25       p50       p75       p100     hist
##     0        0         0         0          0    <U+2581><U+2581><U+2581><U+2587><U+2581><U+2581><U+2581><U+2581>
##     0        2         4         6         31    <U+2587><U+2587><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581>
##     0    50000     70000     1e+05    5119032    <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
##     0        0         0         0          4    <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
##     0        0         0         0          0    <U+2581><U+2581><U+2581><U+2587><U+2581><U+2581><U+2581><U+2581>
##     0        0         0         0          5    <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
##     0        0         0         0         16    <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
##     0        0         0         0       2077    <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
##  1000     8000     14000     22200      40000    <U+2585><U+2587><U+2586><U+2585><U+2583><U+2582><U+2582><U+2582>
##   750     8000     14000     22200      40000    <U+2585><U+2587><U+2586><U+2585><U+2583><U+2582><U+2582><U+2582>
##     0        0         1         2         17    <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
##     0        0         1         3         51    <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
##     0        0         0         1          5    <U+2587><U+2583><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
##    30.64   248.4     389.12    628.49    1597.6  <U+2585><U+2587><U+2585><U+2583><U+2582><U+2581><U+2581><U+2581>
##     6        8.19     11.8      15.02      30.84 <U+2587><U+2586><U+2585><U+2583><U+2582><U+2581><U+2581><U+2581>
##     0      253.7     399.56    654.75   40910.48 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
##  1000     8000     14000     22200      40000    <U+2585><U+2587><U+2586><U+2585><U+2583><U+2582><U+2582><U+2582>
##     0     2342.75   4689.5    8329.5   306125    <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
##     8      100       157       225        691    <U+2585><U+2587><U+2585><U+2582><U+2581><U+2581><U+2581><U+2581>
##     0        4         9        20        217    <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
##     0        3         5        11        183    <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
##     0        0         1         2         14    <U+2587><U+2583><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
##     0        0         0         0         24    <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
##     0        2         3         5         23    <U+2586><U+2587><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581>
##     0        3         5         7         32    <U+2587><U+2586><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581>
##     0        3         4         6         36    <U+2587><U+2586><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
##     0        4         6         9         54    <U+2587><U+2586><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
##     0        3         7        11         59    <U+2587><U+2585><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581>
##     0        5         7        10         41    <U+2586><U+2587><U+2583><U+2581><U+2581><U+2581><U+2581><U+2581>
##     2        7        11        16.25      81    <U+2587><U+2586><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581>
##     0        3         5         7         32    <U+2587><U+2586><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581>
##     0        8        11        15         59    <U+2585><U+2587><U+2583><U+2581><U+2581><U+2581><U+2581><U+2581>
##     0        0         0         0          0    <U+2581><U+2581><U+2581><U+2587><U+2581><U+2581><U+2581><U+2581>
##     0        0         0         0         16    <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
##     0        1         2         3         24    <U+2587><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
##     0        8        11        15         59    <U+2585><U+2587><U+2583><U+2581><U+2581><U+2581><U+2581><U+2581>
##     0        0         1         2         13    <U+2587><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
##     0        1         2         4         32    <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
##     0        0         0         1          6    <U+2587><U+2585><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581>
##     0        1         1         2         19    <U+2587><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
##     0        0         1         2         22    <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
##     0        1         2         4         28    <U+2587><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
##     0     7598.25  13127.7   21312.2    39662.62 <U+2585><U+2587><U+2586><U+2585><U+2583><U+2582><U+2582><U+2582>
##     0     7598.25  13119.87  21312.2    39662.62 <U+2585><U+2587><U+2586><U+2585><U+2583><U+2582><U+2582><U+2582>
##    30.8     92.7     100       100        100    <U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2587>
##     1        1         1         1          1    <U+2581><U+2581><U+2581><U+2587><U+2581><U+2581><U+2581><U+2581>
##     0        0         0         0          3    <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
##     0        0         0         0          3    <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
##     0        0         0         0          0    <U+2581><U+2581><U+2581><U+2587><U+2581><U+2581><U+2581><U+2581>
##     0     5754.5   11404     20933      1e+06    <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
##     0        0         0         0          0    <U+2581><U+2581><U+2581><U+2587><U+2581><U+2581><U+2581><U+2581>
##     0        0         0         0     208593    <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
##     0    29630.25  81663    232323    2163151    <U+2587><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
##     0    56942.5  126331.5  284325      3e+06    <U+2587><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
##     2       14        21        30         95    <U+2585><U+2587><U+2585><U+2582><U+2581><U+2581><U+2581><U+2581>
##     0    21243.25  41046     72852    1077617    <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
##     0     9164     25111     50565.5   590007    <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
##     0    10900     21000     38400     417400    <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
##     0        0         0         2         35    <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
##     0    16349     37231     67178.25  544010    <U+2587><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
##     0      475.58    752.27   1241.56   40856.68 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
##     0      475.57    752.27   1241.27   40856.68 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
##     0      121.11    230.87    413.03    1786.05 <U+2587><U+2585><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581>
##     0        0         0         0         52.99 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
##     0      295.88    483.33    789.23   40000    <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
##     0    18100     31000     52500    1455000    <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>

Visualizations

Visual1

I use a visualization to demonstrate that interest rates differ significantly by the loan term length. This visualization justifies the need to split out the 36 month loans from the 60 month loans. I’ll then contrast this with a couple other visualizations that a person could have tried to use.

BOXPLOTS: I like the boxplots in this case because it shows the distributions side-by-side on the same graph, which really makes it easy to compare the quartiles between term lengths. From the boxplot, a person can see that the 25th percentile, median, and 75th-percentile approved interest rates are significantly lower for the 36 month term loans.

HISTOGRAMS: The histograms are useful to visualize the general shape of the distributions of approved interest rates by length of term. However, these histograms consist of counts and so they’re not the best for the sake of direct comparison of two distributions other than a quick and dirty eye-ball test to see if this is worth looking into further. A better comparison would show percentages instead of raw counts.

DENSITY CHART: The side-by-side density charts have an advantage over the histograms in that the counts have been converted to something that’s more-readily comparable. However, it’s still difficult to tell from the density charts where the exact percentiles of the curve would fall. So again, the density charts are probably not the best for direct comparison, and I still prefer the box plots in this case.

One may find it even easier to compare these two density plots if we overlayed them on the same graph.

Visual2

For this visualization I’m using another box plot, but this time I’m showing approved interest rates by the derived field “high risk” which is defined as having 5 or more delinquencies in the past 2 years. The mutation function from dplyer is used to create the derived field “high risk”.

#BOXPLOT The boxplots show that customers who fall into the “high risk” category have significantly higher interest rates than customers who fall into the “low risk” category.

SCATTERPLOT with JITTER: The following scatterplot reveals that “grade” can be used to identify different interest rate tiers. You can see that the “high risk” customers appear in each of the “grades” A:E, except for Grade B.

However, it’s interesting to note that the interest rates do not appear to differ substantially within each “grade” whether a person falls into the “high risk” category or not (except for the Grade A category. This observation suggests that there are other significant factors driving which “grade” a customer falls into other than the “high risk” variable that I derived.

Visual3

For this visualization I’m creating a new variable called “high income”, then grouping by that derived variable along with the “grade” variable. Then I filter out some of the lower volume categories such as customers who did not specify a type of home ownership (i.e. home_ownership = “ANY”) as well as the grade = “F” and grade = “G” groups which are very low volume (this will make the visuals a bit easier to consume).

#SCATTERPLOTS USING FACETS: Use faceting to show an Interest Rate versus Loan Amount scatterplot for a crosstabulation of Homeownership versus grade. This is an interesting visual because it shows (very loosely) what could be an interaction effect on approved loan amounts between grade and high income. For example, in the Grade A column, high income customers are associated with higher approved loan amounts than low income customers. However, in the Grade D and Grade E categories, the high income customers do not appear to be getting approved for higher loan amounts than low income customers.

## # A tibble: 9,976 x 7
##    loan_amnt int_rate annual_inc home_ownership purpose      grade high_inc
##        <dbl>    <dbl>      <dbl> <chr>          <chr>        <chr> <lgl>   
##  1      2500     13.6      55000 RENT           debt_consol~ C     FALSE   
##  2     30000     18.9      90000 MORTGAGE       debt_consol~ D     FALSE   
##  3      5000     18.0      59280 MORTGAGE       debt_consol~ D     FALSE   
##  4      4000     18.9      92000 MORTGAGE       debt_consol~ D     FALSE   
##  5     30000     16.1      57250 MORTGAGE       debt_consol~ C     FALSE   
##  6      5550     15.0     152500 MORTGAGE       credit_card  C     TRUE    
##  7      2000     18.0      51000 RENT           debt_consol~ D     FALSE   
##  8      6000     13.6      65000 RENT           credit_card  C     FALSE   
##  9      5000     18.0      53580 MORTGAGE       debt_consol~ D     FALSE   
## 10      6000     14.5     300000 OWN            debt_consol~ C     TRUE    
## # ... with 9,966 more rows

Removing the home ownership variable from the faceting makes this a little bit easier to see.

Visual4

For this visualization I want to focus on just the Grade A category (lowest approved rates) and customers who rent their homes. For this subset, I want to investigate whether or not the type of loan requested is an important variable when viewed along with the high income field derived above.

The side-by-side boxplots reveal a couple interesting observations for this subset of customers who rent their homes. First, it’s interesting to note that interest rates do not vary by income class wihtin the credit card and debt loan categories, where the boxplots appear to be nearly identical. Second, there are drastic differences in approved rates within the “medical” and “moving” loan categories.

#BAR GRAPH With the drastic differences we’re seeing in the “medical” and “moving categories”, it might be a good idea to go back and check the sample sizes within these groups to make sure we’re dealing with a large enough data set. Further inspection reveals that the vast majority of our renters are applying for credit cards and debt consolidation. With this in mind, we should probably narrow our investigation to these loan types if we wish to derive further insights that are relevant to this particular subset. If this were a business that we were already familiar with, we might have already had this intuition. When dealing with unfamiliar data as a data scientist, we should never be afraid to go “back to the basics” and look at the count data and basic summary statistics before moving on to more-advanced techniques.

Visual5

The following visualization compares average revolving balances to average loan amounts, grouped by the high income variable.

SCATTERPLOT OF AVERAGE LOAN AMOUNT VERSUS AVERAGE REVOLVING BALANCE: The following scatterplot shows that high income customers tend to be “clustered” in the higher average revolving balance + high average loan amount area of the chart, while low income customers tend to be "clustered in the low average revolving balances + low average loan amount area of the chart.

Lonie Moore

2019-09-24