Introduction

Research Question: How can debt to income, home ownership, and annual income predict loan amount?

The dataset I selected “represents thousands of loans made through the Lending Club platform.” This platform allows individuals to lend money to other individuals. This dataset contains a variety of information on the different individuals from home ownership to annual income to debt to income. There is also information about the loan amount they received and the status of their loans as well as a variety of information that relates to their loans. At 55 variables and 10,000 observations, this dataset covers a wide range of information for thousands of individuals.

The variables I will be focusing on is loan_amount, homeownership, annual_income, and debt_to_income. Loan_amount, annual_income, and debt_to_income are all quantitative (or numeric) variables and homeownership is a categorical variable. I will be using homeownership, annual_income, and debt_to_income to predict information about the loan_amount through multiple linear regression because home ownership, annual income, and debt to income may be key factors in determining the loan amount a person receives. Multiple linear regression is more helpful in this situation as I am largely using non-binary variables and my outcome is continuous.

The link to the source: https://www.openintro.org/data/index.php?data=loans_full_schema. This dataset was retrieved from the site OpenIntro.org Data Sets.

1.Import dataset and libraries

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.0     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.2     ✔ tibble    3.3.1
## ✔ lubridate 1.9.5     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
loans <- read_csv("C:/DATA101/loans_full_schema.csv")
## Rows: 10000 Columns: 55
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (13): emp_title, state, homeownership, verified_income, verification_inc...
## dbl (42): emp_length, annual_income, debt_to_income, annual_income_joint, de...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

2.Explore the dataset with EDA functions

str(loans)
## spc_tbl_ [10,000 × 55] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ emp_title                       : chr [1:10000] "global config engineer" "warehouse office clerk" "assembly" "customer service" ...
##  $ emp_length                      : num [1:10000] 3 10 3 1 10 NA 10 10 10 3 ...
##  $ state                           : chr [1:10000] "NJ" "HI" "WI" "PA" ...
##  $ homeownership                   : chr [1:10000] "MORTGAGE" "RENT" "RENT" "RENT" ...
##  $ annual_income                   : num [1:10000] 90000 40000 40000 30000 35000 34000 35000 110000 65000 30000 ...
##  $ verified_income                 : chr [1:10000] "Verified" "Not Verified" "Source Verified" "Not Verified" ...
##  $ debt_to_income                  : num [1:10000] 18.01 5.04 21.15 10.16 57.96 ...
##  $ annual_income_joint             : num [1:10000] NA NA NA NA 57000 NA 155000 NA NA NA ...
##  $ verification_income_joint       : chr [1:10000] NA NA NA NA ...
##  $ debt_to_income_joint            : num [1:10000] NA NA NA NA 37.7 ...
##  $ delinq_2y                       : num [1:10000] 0 0 0 0 0 1 0 1 1 0 ...
##  $ months_since_last_delinq        : num [1:10000] 38 NA 28 NA NA 3 NA 19 18 NA ...
##  $ earliest_credit_line            : num [1:10000] 2001 1996 2006 2007 2008 ...
##  $ inquiries_last_12m              : num [1:10000] 6 1 4 0 7 6 1 1 3 0 ...
##  $ total_credit_lines              : num [1:10000] 28 30 31 4 22 32 12 30 35 9 ...
##  $ open_credit_lines               : num [1:10000] 10 14 10 4 16 12 10 15 21 6 ...
##  $ total_credit_limit              : num [1:10000] 70795 28800 24193 25400 69839 ...
##  $ total_credit_utilized           : num [1:10000] 38767 4321 16000 4997 52722 ...
##  $ num_collections_last_12m        : num [1:10000] 0 0 0 0 0 0 0 0 0 0 ...
##  $ num_historical_failed_to_pay    : num [1:10000] 0 1 0 1 0 0 0 0 0 0 ...
##  $ months_since_90d_late           : num [1:10000] 38 NA 28 NA NA 60 NA 71 18 NA ...
##  $ current_accounts_delinq         : num [1:10000] 0 0 0 0 0 0 0 0 0 0 ...
##  $ total_collection_amount_ever    : num [1:10000] 1250 0 432 0 0 0 0 0 0 0 ...
##  $ current_installment_accounts    : num [1:10000] 2 0 1 1 1 0 2 2 6 1 ...
##  $ accounts_opened_24m             : num [1:10000] 5 11 13 1 6 2 1 4 10 5 ...
##  $ months_since_last_credit_inquiry: num [1:10000] 5 8 7 15 4 5 9 7 4 17 ...
##  $ num_satisfactory_accounts       : num [1:10000] 10 14 10 4 16 12 10 15 21 6 ...
##  $ num_accounts_120d_past_due      : num [1:10000] 0 0 0 0 0 0 0 NA 0 0 ...
##  $ num_accounts_30d_past_due       : num [1:10000] 0 0 0 0 0 0 0 0 0 0 ...
##  $ num_active_debit_accounts       : num [1:10000] 2 3 3 2 10 1 3 5 11 3 ...
##  $ total_debit_limit               : num [1:10000] 11100 16500 4300 19400 32700 ...
##  $ num_total_cc_accounts           : num [1:10000] 14 24 14 3 20 27 8 16 19 7 ...
##  $ num_open_cc_accounts            : num [1:10000] 8 14 8 3 15 12 7 12 14 5 ...
##  $ num_cc_carrying_balance         : num [1:10000] 6 4 6 2 13 5 6 10 14 3 ...
##  $ num_mort_accounts               : num [1:10000] 1 0 0 0 0 3 2 7 2 0 ...
##  $ account_never_delinq_percent    : num [1:10000] 92.9 100 93.5 100 100 78.1 100 93 97.1 100 ...
##  $ tax_liens                       : num [1:10000] 0 0 0 1 0 0 0 0 0 0 ...
##  $ public_record_bankrupt          : num [1:10000] 0 1 0 0 0 0 0 0 0 0 ...
##  $ loan_purpose                    : chr [1:10000] "moving" "debt_consolidation" "other" "debt_consolidation" ...
##  $ application_type                : chr [1:10000] "individual" "individual" "individual" "individual" ...
##  $ loan_amount                     : num [1:10000] 28000 5000 2000 21600 23000 5000 24000 20000 20000 6400 ...
##  $ term                            : num [1:10000] 60 36 36 36 36 36 60 60 36 36 ...
##  $ interest_rate                   : num [1:10000] 14.07 12.61 17.09 6.72 14.07 ...
##  $ installment                     : num [1:10000] 652.5 167.5 71.4 664.2 786.9 ...
##  $ grade                           : chr [1:10000] "C" "C" "D" "A" ...
##  $ sub_grade                       : chr [1:10000] "C3" "C1" "D1" "A3" ...
##  $ issue_month                     : chr [1:10000] "Mar-2018" "Feb-2018" "Feb-2018" "Jan-2018" ...
##  $ loan_status                     : chr [1:10000] "Current" "Current" "Current" "Current" ...
##  $ initial_listing_status          : chr [1:10000] "whole" "whole" "fractional" "whole" ...
##  $ disbursement_method             : chr [1:10000] "Cash" "Cash" "Cash" "Cash" ...
##  $ balance                         : num [1:10000] 27016 4651 1825 18853 21430 ...
##  $ paid_total                      : num [1:10000] 1999 499 282 3313 2325 ...
##  $ paid_principal                  : num [1:10000] 984 349 175 2747 1570 ...
##  $ paid_interest                   : num [1:10000] 1015 150 106 566 755 ...
##  $ paid_late_fees                  : num [1:10000] 0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   emp_title = col_character(),
##   ..   emp_length = col_double(),
##   ..   state = col_character(),
##   ..   homeownership = col_character(),
##   ..   annual_income = col_double(),
##   ..   verified_income = col_character(),
##   ..   debt_to_income = col_double(),
##   ..   annual_income_joint = col_double(),
##   ..   verification_income_joint = col_character(),
##   ..   debt_to_income_joint = col_double(),
##   ..   delinq_2y = col_double(),
##   ..   months_since_last_delinq = col_double(),
##   ..   earliest_credit_line = col_double(),
##   ..   inquiries_last_12m = col_double(),
##   ..   total_credit_lines = col_double(),
##   ..   open_credit_lines = col_double(),
##   ..   total_credit_limit = col_double(),
##   ..   total_credit_utilized = col_double(),
##   ..   num_collections_last_12m = col_double(),
##   ..   num_historical_failed_to_pay = col_double(),
##   ..   months_since_90d_late = col_double(),
##   ..   current_accounts_delinq = col_double(),
##   ..   total_collection_amount_ever = col_double(),
##   ..   current_installment_accounts = col_double(),
##   ..   accounts_opened_24m = col_double(),
##   ..   months_since_last_credit_inquiry = col_double(),
##   ..   num_satisfactory_accounts = col_double(),
##   ..   num_accounts_120d_past_due = col_double(),
##   ..   num_accounts_30d_past_due = col_double(),
##   ..   num_active_debit_accounts = col_double(),
##   ..   total_debit_limit = col_double(),
##   ..   num_total_cc_accounts = col_double(),
##   ..   num_open_cc_accounts = col_double(),
##   ..   num_cc_carrying_balance = col_double(),
##   ..   num_mort_accounts = col_double(),
##   ..   account_never_delinq_percent = col_double(),
##   ..   tax_liens = col_double(),
##   ..   public_record_bankrupt = col_double(),
##   ..   loan_purpose = col_character(),
##   ..   application_type = col_character(),
##   ..   loan_amount = col_double(),
##   ..   term = col_double(),
##   ..   interest_rate = col_double(),
##   ..   installment = col_double(),
##   ..   grade = col_character(),
##   ..   sub_grade = col_character(),
##   ..   issue_month = col_character(),
##   ..   loan_status = col_character(),
##   ..   initial_listing_status = col_character(),
##   ..   disbursement_method = col_character(),
##   ..   balance = col_double(),
##   ..   paid_total = col_double(),
##   ..   paid_principal = col_double(),
##   ..   paid_interest = col_double(),
##   ..   paid_late_fees = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
head(loans)
## # A tibble: 6 × 55
##   emp_title         emp_length state homeownership annual_income verified_income
##   <chr>                  <dbl> <chr> <chr>                 <dbl> <chr>          
## 1 global config en…          3 NJ    MORTGAGE              90000 Verified       
## 2 warehouse office…         10 HI    RENT                  40000 Not Verified   
## 3 assembly                   3 WI    RENT                  40000 Source Verified
## 4 customer service           1 PA    RENT                  30000 Not Verified   
## 5 security supervi…         10 CA    RENT                  35000 Verified       
## 6 <NA>                      NA KY    OWN                   34000 Not Verified   
## # ℹ 49 more variables: debt_to_income <dbl>, annual_income_joint <dbl>,
## #   verification_income_joint <chr>, debt_to_income_joint <dbl>,
## #   delinq_2y <dbl>, months_since_last_delinq <dbl>,
## #   earliest_credit_line <dbl>, inquiries_last_12m <dbl>,
## #   total_credit_lines <dbl>, open_credit_lines <dbl>,
## #   total_credit_limit <dbl>, total_credit_utilized <dbl>,
## #   num_collections_last_12m <dbl>, num_historical_failed_to_pay <dbl>, …
summary(loans)
##   emp_title           emp_length       state           homeownership     
##  Length:10000       Min.   : 0.00   Length:10000       Length:10000      
##  Class :character   1st Qu.: 2.00   Class :character   Class :character  
##  Mode  :character   Median : 6.00   Mode  :character   Mode  :character  
##                     Mean   : 5.93                                        
##                     3rd Qu.:10.00                                        
##                     Max.   :10.00                                        
##                     NA's   :817                                          
##  annual_income     verified_income    debt_to_income   annual_income_joint
##  Min.   :      0   Length:10000       Min.   :  0.00   Min.   :  19200    
##  1st Qu.:  45000   Class :character   1st Qu.: 11.06   1st Qu.:  86834    
##  Median :  65000   Mode  :character   Median : 17.57   Median : 113000    
##  Mean   :  79222                      Mean   : 19.31   Mean   : 127915    
##  3rd Qu.:  95000                      3rd Qu.: 25.00   3rd Qu.: 151546    
##  Max.   :2300000                      Max.   :469.09   Max.   :1100000    
##                                       NA's   :24       NA's   :8505       
##  verification_income_joint debt_to_income_joint   delinq_2y     
##  Length:10000              Min.   : 0.32        Min.   : 0.000  
##  Class :character          1st Qu.:14.16        1st Qu.: 0.000  
##  Mode  :character          Median :19.72        Median : 0.000  
##                            Mean   :19.98        Mean   : 0.216  
##                            3rd Qu.:25.50        3rd Qu.: 0.000  
##                            Max.   :39.98        Max.   :13.000  
##                            NA's   :8505                         
##  months_since_last_delinq earliest_credit_line inquiries_last_12m
##  Min.   :  1.00           Min.   :1963         Min.   : 0.000    
##  1st Qu.: 19.00           1st Qu.:1997         1st Qu.: 0.000    
##  Median : 34.00           Median :2003         Median : 1.000    
##  Mean   : 36.76           Mean   :2001         Mean   : 1.958    
##  3rd Qu.: 53.00           3rd Qu.:2006         3rd Qu.: 3.000    
##  Max.   :118.00           Max.   :2015         Max.   :29.000    
##  NA's   :5658                                                    
##  total_credit_lines open_credit_lines total_credit_limit total_credit_utilized
##  Min.   : 2.00      Min.   : 0.0      Min.   :      0    Min.   :     0       
##  1st Qu.:14.00      1st Qu.: 7.0      1st Qu.:  51594    1st Qu.: 19186       
##  Median :21.00      Median :10.0      Median : 114667    Median : 36927       
##  Mean   :22.68      Mean   :11.4      Mean   : 183606    Mean   : 51049       
##  3rd Qu.:29.00      3rd Qu.:14.0      3rd Qu.: 267550    3rd Qu.: 65421       
##  Max.   :87.00      Max.   :51.0      Max.   :3386034    Max.   :942456       
##                                                                               
##  num_collections_last_12m num_historical_failed_to_pay months_since_90d_late
##  Min.   :0.0000           Min.   : 0.0000              Min.   :  2.00       
##  1st Qu.:0.0000           1st Qu.: 0.0000              1st Qu.: 29.00       
##  Median :0.0000           Median : 0.0000              Median : 47.00       
##  Mean   :0.0138           Mean   : 0.1671              Mean   : 46.11       
##  3rd Qu.:0.0000           3rd Qu.: 0.0000              3rd Qu.: 63.00       
##  Max.   :3.0000           Max.   :52.0000              Max.   :128.00       
##                                                        NA's   :7715         
##  current_accounts_delinq total_collection_amount_ever
##  Min.   :0e+00           Min.   :     0.0            
##  1st Qu.:0e+00           1st Qu.:     0.0            
##  Median :0e+00           Median :     0.0            
##  Mean   :1e-04           Mean   :   184.3            
##  3rd Qu.:0e+00           3rd Qu.:     0.0            
##  Max.   :1e+00           Max.   :199308.0            
##                                                      
##  current_installment_accounts accounts_opened_24m
##  Min.   : 0.000               Min.   : 0.000     
##  1st Qu.: 1.000               1st Qu.: 2.000     
##  Median : 2.000               Median : 4.000     
##  Mean   : 2.664               Mean   : 4.376     
##  3rd Qu.: 3.000               3rd Qu.: 6.000     
##  Max.   :35.000               Max.   :29.000     
##                                                  
##  months_since_last_credit_inquiry num_satisfactory_accounts
##  Min.   : 0.000                   Min.   : 0.00            
##  1st Qu.: 2.000                   1st Qu.: 7.00            
##  Median : 6.000                   Median :10.00            
##  Mean   : 7.341                   Mean   :11.38            
##  3rd Qu.:11.000                   3rd Qu.:14.00            
##  Max.   :24.000                   Max.   :51.00            
##  NA's   :1271                                              
##  num_accounts_120d_past_due num_accounts_30d_past_due num_active_debit_accounts
##  Min.   :0                  Min.   :0e+00             Min.   : 0.000           
##  1st Qu.:0                  1st Qu.:0e+00             1st Qu.: 2.000           
##  Median :0                  Median :0e+00             Median : 3.000           
##  Mean   :0                  Mean   :1e-04             Mean   : 3.595           
##  3rd Qu.:0                  3rd Qu.:0e+00             3rd Qu.: 5.000           
##  Max.   :0                  Max.   :1e+00             Max.   :32.000           
##  NA's   :318                                                                   
##  total_debit_limit num_total_cc_accounts num_open_cc_accounts
##  Min.   :     0    Min.   : 2.00         Min.   : 0.000      
##  1st Qu.: 10000    1st Qu.: 7.00         1st Qu.: 5.000      
##  Median : 19500    Median :11.00         Median : 7.000      
##  Mean   : 27357    Mean   :13.03         Mean   : 8.095      
##  3rd Qu.: 36100    3rd Qu.:17.00         3rd Qu.:10.000      
##  Max.   :386700    Max.   :66.00         Max.   :46.000      
##                                                              
##  num_cc_carrying_balance num_mort_accounts account_never_delinq_percent
##  Min.   : 0.000          Min.   : 0.000    Min.   : 14.30              
##  1st Qu.: 3.000          1st Qu.: 0.000    1st Qu.: 92.60              
##  Median : 5.000          Median : 1.000    Median :100.00              
##  Mean   : 5.231          Mean   : 1.383    Mean   : 94.65              
##  3rd Qu.: 7.000          3rd Qu.: 2.000    3rd Qu.:100.00              
##  Max.   :43.000          Max.   :14.000    Max.   :100.00              
##                                                                        
##    tax_liens       public_record_bankrupt loan_purpose       application_type  
##  Min.   : 0.0000   Min.   :0.0000         Length:10000       Length:10000      
##  1st Qu.: 0.0000   1st Qu.:0.0000         Class :character   Class :character  
##  Median : 0.0000   Median :0.0000         Mode  :character   Mode  :character  
##  Mean   : 0.0433   Mean   :0.1238                                              
##  3rd Qu.: 0.0000   3rd Qu.:0.0000                                              
##  Max.   :52.0000   Max.   :3.0000                                              
##                                                                                
##   loan_amount         term       interest_rate    installment     
##  Min.   : 1000   Min.   :36.00   Min.   : 5.31   Min.   :  30.75  
##  1st Qu.: 8000   1st Qu.:36.00   1st Qu.: 9.43   1st Qu.: 256.04  
##  Median :14500   Median :36.00   Median :11.98   Median : 398.42  
##  Mean   :16362   Mean   :43.27   Mean   :12.43   Mean   : 476.21  
##  3rd Qu.:24000   3rd Qu.:60.00   3rd Qu.:15.05   3rd Qu.: 644.69  
##  Max.   :40000   Max.   :60.00   Max.   :30.94   Max.   :1566.59  
##                                                                   
##     grade            sub_grade         issue_month        loan_status       
##  Length:10000       Length:10000       Length:10000       Length:10000      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  initial_listing_status disbursement_method    balance        paid_total     
##  Length:10000           Length:10000        Min.   :    0   Min.   :    0.0  
##  Class :character       Class :character    1st Qu.: 6679   1st Qu.:  928.7  
##  Mode  :character       Mode  :character    Median :12379   Median : 1563.3  
##                                             Mean   :14459   Mean   : 2494.2  
##                                             3rd Qu.:20690   3rd Qu.: 2616.0  
##                                             Max.   :40000   Max.   :41630.4  
##                                                                              
##  paid_principal    paid_interest    paid_late_fees   
##  Min.   :    0.0   Min.   :   0.0   Min.   : 0.0000  
##  1st Qu.:  587.1   1st Qu.: 221.8   1st Qu.: 0.0000  
##  Median :  985.0   Median : 446.1   Median : 0.0000  
##  Mean   : 1894.4   Mean   : 599.7   Mean   : 0.1195  
##  3rd Qu.: 1694.6   3rd Qu.: 825.4   3rd Qu.: 0.0000  
##  Max.   :40000.0   Max.   :4216.4   Max.   :52.9800  
## 

3.Clean data

Analysis: For the analysis of my dataset I started by exploring the dataset as shown through the code above. After, I decided to clean my dataset starting with the NAs. For the variables I chose, a majority did not have any NAs. However, the variable debt_to_income did have NAs. There were 24 NAs, which is very small in comparison to the 10,000 observations this dataset had, so they won’t have much of an impact but I still fixed the NAs by replacing them with the median. After, I began preparing my data for my multiple linear regression. I selected the necessary variables that I would be using, I changed homeownership into a factor, and then had the loan amount arranged in ascending order.

#Check for any NAs for the variables I will be using
colSums(is.na(loans))
##                        emp_title                       emp_length 
##                              833                              817 
##                            state                    homeownership 
##                                0                                0 
##                    annual_income                  verified_income 
##                                0                                0 
##                   debt_to_income              annual_income_joint 
##                               24                             8505 
##        verification_income_joint             debt_to_income_joint 
##                             8545                             8505 
##                        delinq_2y         months_since_last_delinq 
##                                0                             5658 
##             earliest_credit_line               inquiries_last_12m 
##                                0                                0 
##               total_credit_lines                open_credit_lines 
##                                0                                0 
##               total_credit_limit            total_credit_utilized 
##                                0                                0 
##         num_collections_last_12m     num_historical_failed_to_pay 
##                                0                                0 
##            months_since_90d_late          current_accounts_delinq 
##                             7715                                0 
##     total_collection_amount_ever     current_installment_accounts 
##                                0                                0 
##              accounts_opened_24m months_since_last_credit_inquiry 
##                                0                             1271 
##        num_satisfactory_accounts       num_accounts_120d_past_due 
##                                0                              318 
##        num_accounts_30d_past_due        num_active_debit_accounts 
##                                0                                0 
##                total_debit_limit            num_total_cc_accounts 
##                                0                                0 
##             num_open_cc_accounts          num_cc_carrying_balance 
##                                0                                0 
##                num_mort_accounts     account_never_delinq_percent 
##                                0                                0 
##                        tax_liens           public_record_bankrupt 
##                                0                                0 
##                     loan_purpose                 application_type 
##                                0                                0 
##                      loan_amount                             term 
##                                0                                0 
##                    interest_rate                      installment 
##                                0                                0 
##                            grade                        sub_grade 
##                                0                                0 
##                      issue_month                      loan_status 
##                                0                                0 
##           initial_listing_status              disbursement_method 
##                                0                                0 
##                          balance                       paid_total 
##                                0                                0 
##                   paid_principal                    paid_interest 
##                                0                                0 
##                   paid_late_fees 
##                                0
#Fixing all NAs in the variables I will be using
median_dti <- median(loans$debt_to_income, na.rm = TRUE) 
loans$debt_to_income[is.na(loans$debt_to_income)] <- median_dti

Note: The variables loan_amount, homeownership, annual_income do not have any missing data. However, the debt_to_income variable has only 24 which is very small in comparison to the 10,000 observations. I opted to fix the NAs by utilizing the median.

4. Analyze the data

loans_for_model <- loans |>
  select(homeownership, annual_income, debt_to_income, loan_amount) |>
  mutate (homeownership = as.factor(homeownership)) |>
  arrange(loan_amount)

5. Regression Analysis - Multiple Linear Regression

model <- lm(loan_amount ~ homeownership + annual_income + debt_to_income, data = loans_for_model)

summary(model)
## 
## Call:
## lm(formula = loan_amount ~ homeownership + annual_income + debt_to_income, 
##     data = loans_for_model)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -91406  -7132  -1954   5984  29121 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.190e+04  2.535e+02  46.958  < 2e-16 ***
## homeownershipOWN  -1.305e+03  2.979e+02  -4.379  1.2e-05 ***
## homeownershipRENT -2.318e+03  2.123e+02 -10.919  < 2e-16 ***
## annual_income      5.186e-02  1.539e-03  33.700  < 2e-16 ***
## debt_to_income     7.357e+01  6.561e+00  11.213  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9614 on 9995 degrees of freedom
## Multiple R-squared:  0.1294, Adjusted R-squared:  0.1291 
## F-statistic: 371.5 on 4 and 9995 DF,  p-value: < 2.2e-16

Coefficients: Note: There are three kinds of homeownerships: OWN, RENT, and MORTGAGE. The MORTGAGE one is used as the baseline and OWN and RENT are extra indicators.

homeownershipOWN: The coefficient for those who own homes is -1,305 which means those who own home typically get loans amounts that is $1,305 smaller than the baseline (MORTGAGE).

homeownershipRENT: The coefficient for those who rent homes is -2318 which means those who rent their home typically get loan amounts that are $2,318 smaller than the baseline.

annual_income: The coefficient for annual_income is 0.05186. For every unit increase in annual income, there is an increase of $0.05 increase in loan amount.

debt_to_income: The coefficient for debt_to_income is 73.57. For every unit increase in debt to income, the loan amount increases by $73.57

6. Model Assumptions and Diagnostics

1. Linearity Check

library(car)
## Warning: package 'car' was built under R version 4.5.3
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.5.3
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
crPlots(model)

Results: For the annual_income, there is a sharp upward trend followed by a downward trend, the linearity for this variable is not good. Therefore the predictions are most likely not reliable. For the debt_to_income, there is a steady upward trend so linearity is reasonable. For homeownership, it’s treated as a factor so linearity doesn’t apply the same way but it looks fine.

2. Independence

plot(resid(model), type="b",
     main="Residuals vs Order", ylab="Residuals"); abline(h=0, lty=2)

Results: For the most part, most of the data is centered around the 0 with a few outbursts around 4400 and 10000. However, there is an upward trend in this graph, indicating that the data may not be independent.

3. Core diagnostics (covers: linearity, homoscedasticity, normality, influence)

par(mfrow=c(2,2)); plot(model); par(mfrow=c(1,1))

Residual vs Fitted: There is a clear downward trend and there is a funnel shape, meaning there is non-linearity.

Homoscedasticity (equal variance): As noted earlier by the residual vs fitted, there is a funnel shape which means there isn’t equal variance. There is heteroscedasticity.

Scale Location: There is a clear upward trend, meaning there is heteroscedasticity.

Q-Q: At the start there is left-tale deviation, but the rest is fine so there is slight non-normality.

Residuals vs Leverage: There are few high leverage points which can be influential.

4. Check Multicollinearity

cor(loans_for_model[, c("annual_income", "debt_to_income")], use = "complete.obs")
##                annual_income debt_to_income
## annual_income      1.0000000     -0.1797765
## debt_to_income    -0.1797765      1.0000000

Note: Since homeownership is a factor, I did not use it to check for multicollinearity

There is very low correlation between annual_income and debt_to_income at -0.18 so there isn’t high multicollinearity.

Conclusion

In conclusion, annual income was a significant positive predictor where people with higher incomes receive larger loans. The debt to income was also a significant positive predictor where people with larger debt to income receive larger loans. The home ownership status was also significant for this model where both renters and owners of homes get a smaller loan amount than those who’s home ownership status is mortgage holders. Based on my model these factors play a significant role as a predictor for loan amount.

This research is relevant as it can help people who intend to take loans out understand the factors that may play into the loan amount they may get.It can also be used to understand that borrowing capacity is a large factor in the amount people are able to take out.

There are quite a few limitations to this model because this model has poor abilities to make reliable predictions as it violates several requirements needed for a multiple linear regression model. My model does not have high multicollinearity, but there is non-normality, non-linearity, heteroscedasticity, and possible non-independence. The adjusted R² value is extremely low at only 0.1291, meaning this model can explain about 13% of the loan amount variation. Not to mention this model only looks at a few factors rather than the many other factors that may be key predictors.

For future improvements, the most important thing to focus on is building a more reliable model where there is normality, linearity, and proper independence. The current data I have is not helpful for developing reliable predictions therefore creating a new model would be more important.

References: https://www.openintro.org/data/index.php?data=loans_full_schema