Research Question: How can debt to income, home ownership, and annual income predict loan amount?
The dataset I selected “represents thousands of loans made through the Lending Club platform.” This platform allows individuals to lend money to other individuals. This dataset contains a variety of information on the different individuals from home ownership to annual income to debt to income. There is also information about the loan amount they received and the status of their loans as well as a variety of information that relates to their loans. At 55 variables and 10,000 observations, this dataset covers a wide range of information for thousands of individuals.
The variables I will be focusing on is loan_amount, homeownership, annual_income, and debt_to_income. Loan_amount, annual_income, and debt_to_income are all quantitative (or numeric) variables and homeownership is a categorical variable. I will be using homeownership, annual_income, and debt_to_income to predict information about the loan_amount through multiple linear regression because home ownership, annual income, and debt to income may be key factors in determining the loan amount a person receives. Multiple linear regression is more helpful in this situation as I am largely using non-binary variables and my outcome is continuous.
The link to the source: https://www.openintro.org/data/index.php?data=loans_full_schema. This dataset was retrieved from the site OpenIntro.org Data Sets.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.0 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.2 ✔ tibble 3.3.1
## ✔ lubridate 1.9.5 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
loans <- read_csv("C:/DATA101/loans_full_schema.csv")
## Rows: 10000 Columns: 55
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (13): emp_title, state, homeownership, verified_income, verification_inc...
## dbl (42): emp_length, annual_income, debt_to_income, annual_income_joint, de...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(loans)
## spc_tbl_ [10,000 × 55] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ emp_title : chr [1:10000] "global config engineer" "warehouse office clerk" "assembly" "customer service" ...
## $ emp_length : num [1:10000] 3 10 3 1 10 NA 10 10 10 3 ...
## $ state : chr [1:10000] "NJ" "HI" "WI" "PA" ...
## $ homeownership : chr [1:10000] "MORTGAGE" "RENT" "RENT" "RENT" ...
## $ annual_income : num [1:10000] 90000 40000 40000 30000 35000 34000 35000 110000 65000 30000 ...
## $ verified_income : chr [1:10000] "Verified" "Not Verified" "Source Verified" "Not Verified" ...
## $ debt_to_income : num [1:10000] 18.01 5.04 21.15 10.16 57.96 ...
## $ annual_income_joint : num [1:10000] NA NA NA NA 57000 NA 155000 NA NA NA ...
## $ verification_income_joint : chr [1:10000] NA NA NA NA ...
## $ debt_to_income_joint : num [1:10000] NA NA NA NA 37.7 ...
## $ delinq_2y : num [1:10000] 0 0 0 0 0 1 0 1 1 0 ...
## $ months_since_last_delinq : num [1:10000] 38 NA 28 NA NA 3 NA 19 18 NA ...
## $ earliest_credit_line : num [1:10000] 2001 1996 2006 2007 2008 ...
## $ inquiries_last_12m : num [1:10000] 6 1 4 0 7 6 1 1 3 0 ...
## $ total_credit_lines : num [1:10000] 28 30 31 4 22 32 12 30 35 9 ...
## $ open_credit_lines : num [1:10000] 10 14 10 4 16 12 10 15 21 6 ...
## $ total_credit_limit : num [1:10000] 70795 28800 24193 25400 69839 ...
## $ total_credit_utilized : num [1:10000] 38767 4321 16000 4997 52722 ...
## $ num_collections_last_12m : num [1:10000] 0 0 0 0 0 0 0 0 0 0 ...
## $ num_historical_failed_to_pay : num [1:10000] 0 1 0 1 0 0 0 0 0 0 ...
## $ months_since_90d_late : num [1:10000] 38 NA 28 NA NA 60 NA 71 18 NA ...
## $ current_accounts_delinq : num [1:10000] 0 0 0 0 0 0 0 0 0 0 ...
## $ total_collection_amount_ever : num [1:10000] 1250 0 432 0 0 0 0 0 0 0 ...
## $ current_installment_accounts : num [1:10000] 2 0 1 1 1 0 2 2 6 1 ...
## $ accounts_opened_24m : num [1:10000] 5 11 13 1 6 2 1 4 10 5 ...
## $ months_since_last_credit_inquiry: num [1:10000] 5 8 7 15 4 5 9 7 4 17 ...
## $ num_satisfactory_accounts : num [1:10000] 10 14 10 4 16 12 10 15 21 6 ...
## $ num_accounts_120d_past_due : num [1:10000] 0 0 0 0 0 0 0 NA 0 0 ...
## $ num_accounts_30d_past_due : num [1:10000] 0 0 0 0 0 0 0 0 0 0 ...
## $ num_active_debit_accounts : num [1:10000] 2 3 3 2 10 1 3 5 11 3 ...
## $ total_debit_limit : num [1:10000] 11100 16500 4300 19400 32700 ...
## $ num_total_cc_accounts : num [1:10000] 14 24 14 3 20 27 8 16 19 7 ...
## $ num_open_cc_accounts : num [1:10000] 8 14 8 3 15 12 7 12 14 5 ...
## $ num_cc_carrying_balance : num [1:10000] 6 4 6 2 13 5 6 10 14 3 ...
## $ num_mort_accounts : num [1:10000] 1 0 0 0 0 3 2 7 2 0 ...
## $ account_never_delinq_percent : num [1:10000] 92.9 100 93.5 100 100 78.1 100 93 97.1 100 ...
## $ tax_liens : num [1:10000] 0 0 0 1 0 0 0 0 0 0 ...
## $ public_record_bankrupt : num [1:10000] 0 1 0 0 0 0 0 0 0 0 ...
## $ loan_purpose : chr [1:10000] "moving" "debt_consolidation" "other" "debt_consolidation" ...
## $ application_type : chr [1:10000] "individual" "individual" "individual" "individual" ...
## $ loan_amount : num [1:10000] 28000 5000 2000 21600 23000 5000 24000 20000 20000 6400 ...
## $ term : num [1:10000] 60 36 36 36 36 36 60 60 36 36 ...
## $ interest_rate : num [1:10000] 14.07 12.61 17.09 6.72 14.07 ...
## $ installment : num [1:10000] 652.5 167.5 71.4 664.2 786.9 ...
## $ grade : chr [1:10000] "C" "C" "D" "A" ...
## $ sub_grade : chr [1:10000] "C3" "C1" "D1" "A3" ...
## $ issue_month : chr [1:10000] "Mar-2018" "Feb-2018" "Feb-2018" "Jan-2018" ...
## $ loan_status : chr [1:10000] "Current" "Current" "Current" "Current" ...
## $ initial_listing_status : chr [1:10000] "whole" "whole" "fractional" "whole" ...
## $ disbursement_method : chr [1:10000] "Cash" "Cash" "Cash" "Cash" ...
## $ balance : num [1:10000] 27016 4651 1825 18853 21430 ...
## $ paid_total : num [1:10000] 1999 499 282 3313 2325 ...
## $ paid_principal : num [1:10000] 984 349 175 2747 1570 ...
## $ paid_interest : num [1:10000] 1015 150 106 566 755 ...
## $ paid_late_fees : num [1:10000] 0 0 0 0 0 0 0 0 0 0 ...
## - attr(*, "spec")=
## .. cols(
## .. emp_title = col_character(),
## .. emp_length = col_double(),
## .. state = col_character(),
## .. homeownership = col_character(),
## .. annual_income = col_double(),
## .. verified_income = col_character(),
## .. debt_to_income = col_double(),
## .. annual_income_joint = col_double(),
## .. verification_income_joint = col_character(),
## .. debt_to_income_joint = col_double(),
## .. delinq_2y = col_double(),
## .. months_since_last_delinq = col_double(),
## .. earliest_credit_line = col_double(),
## .. inquiries_last_12m = col_double(),
## .. total_credit_lines = col_double(),
## .. open_credit_lines = col_double(),
## .. total_credit_limit = col_double(),
## .. total_credit_utilized = col_double(),
## .. num_collections_last_12m = col_double(),
## .. num_historical_failed_to_pay = col_double(),
## .. months_since_90d_late = col_double(),
## .. current_accounts_delinq = col_double(),
## .. total_collection_amount_ever = col_double(),
## .. current_installment_accounts = col_double(),
## .. accounts_opened_24m = col_double(),
## .. months_since_last_credit_inquiry = col_double(),
## .. num_satisfactory_accounts = col_double(),
## .. num_accounts_120d_past_due = col_double(),
## .. num_accounts_30d_past_due = col_double(),
## .. num_active_debit_accounts = col_double(),
## .. total_debit_limit = col_double(),
## .. num_total_cc_accounts = col_double(),
## .. num_open_cc_accounts = col_double(),
## .. num_cc_carrying_balance = col_double(),
## .. num_mort_accounts = col_double(),
## .. account_never_delinq_percent = col_double(),
## .. tax_liens = col_double(),
## .. public_record_bankrupt = col_double(),
## .. loan_purpose = col_character(),
## .. application_type = col_character(),
## .. loan_amount = col_double(),
## .. term = col_double(),
## .. interest_rate = col_double(),
## .. installment = col_double(),
## .. grade = col_character(),
## .. sub_grade = col_character(),
## .. issue_month = col_character(),
## .. loan_status = col_character(),
## .. initial_listing_status = col_character(),
## .. disbursement_method = col_character(),
## .. balance = col_double(),
## .. paid_total = col_double(),
## .. paid_principal = col_double(),
## .. paid_interest = col_double(),
## .. paid_late_fees = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
head(loans)
## # A tibble: 6 × 55
## emp_title emp_length state homeownership annual_income verified_income
## <chr> <dbl> <chr> <chr> <dbl> <chr>
## 1 global config en… 3 NJ MORTGAGE 90000 Verified
## 2 warehouse office… 10 HI RENT 40000 Not Verified
## 3 assembly 3 WI RENT 40000 Source Verified
## 4 customer service 1 PA RENT 30000 Not Verified
## 5 security supervi… 10 CA RENT 35000 Verified
## 6 <NA> NA KY OWN 34000 Not Verified
## # ℹ 49 more variables: debt_to_income <dbl>, annual_income_joint <dbl>,
## # verification_income_joint <chr>, debt_to_income_joint <dbl>,
## # delinq_2y <dbl>, months_since_last_delinq <dbl>,
## # earliest_credit_line <dbl>, inquiries_last_12m <dbl>,
## # total_credit_lines <dbl>, open_credit_lines <dbl>,
## # total_credit_limit <dbl>, total_credit_utilized <dbl>,
## # num_collections_last_12m <dbl>, num_historical_failed_to_pay <dbl>, …
summary(loans)
## emp_title emp_length state homeownership
## Length:10000 Min. : 0.00 Length:10000 Length:10000
## Class :character 1st Qu.: 2.00 Class :character Class :character
## Mode :character Median : 6.00 Mode :character Mode :character
## Mean : 5.93
## 3rd Qu.:10.00
## Max. :10.00
## NA's :817
## annual_income verified_income debt_to_income annual_income_joint
## Min. : 0 Length:10000 Min. : 0.00 Min. : 19200
## 1st Qu.: 45000 Class :character 1st Qu.: 11.06 1st Qu.: 86834
## Median : 65000 Mode :character Median : 17.57 Median : 113000
## Mean : 79222 Mean : 19.31 Mean : 127915
## 3rd Qu.: 95000 3rd Qu.: 25.00 3rd Qu.: 151546
## Max. :2300000 Max. :469.09 Max. :1100000
## NA's :24 NA's :8505
## verification_income_joint debt_to_income_joint delinq_2y
## Length:10000 Min. : 0.32 Min. : 0.000
## Class :character 1st Qu.:14.16 1st Qu.: 0.000
## Mode :character Median :19.72 Median : 0.000
## Mean :19.98 Mean : 0.216
## 3rd Qu.:25.50 3rd Qu.: 0.000
## Max. :39.98 Max. :13.000
## NA's :8505
## months_since_last_delinq earliest_credit_line inquiries_last_12m
## Min. : 1.00 Min. :1963 Min. : 0.000
## 1st Qu.: 19.00 1st Qu.:1997 1st Qu.: 0.000
## Median : 34.00 Median :2003 Median : 1.000
## Mean : 36.76 Mean :2001 Mean : 1.958
## 3rd Qu.: 53.00 3rd Qu.:2006 3rd Qu.: 3.000
## Max. :118.00 Max. :2015 Max. :29.000
## NA's :5658
## total_credit_lines open_credit_lines total_credit_limit total_credit_utilized
## Min. : 2.00 Min. : 0.0 Min. : 0 Min. : 0
## 1st Qu.:14.00 1st Qu.: 7.0 1st Qu.: 51594 1st Qu.: 19186
## Median :21.00 Median :10.0 Median : 114667 Median : 36927
## Mean :22.68 Mean :11.4 Mean : 183606 Mean : 51049
## 3rd Qu.:29.00 3rd Qu.:14.0 3rd Qu.: 267550 3rd Qu.: 65421
## Max. :87.00 Max. :51.0 Max. :3386034 Max. :942456
##
## num_collections_last_12m num_historical_failed_to_pay months_since_90d_late
## Min. :0.0000 Min. : 0.0000 Min. : 2.00
## 1st Qu.:0.0000 1st Qu.: 0.0000 1st Qu.: 29.00
## Median :0.0000 Median : 0.0000 Median : 47.00
## Mean :0.0138 Mean : 0.1671 Mean : 46.11
## 3rd Qu.:0.0000 3rd Qu.: 0.0000 3rd Qu.: 63.00
## Max. :3.0000 Max. :52.0000 Max. :128.00
## NA's :7715
## current_accounts_delinq total_collection_amount_ever
## Min. :0e+00 Min. : 0.0
## 1st Qu.:0e+00 1st Qu.: 0.0
## Median :0e+00 Median : 0.0
## Mean :1e-04 Mean : 184.3
## 3rd Qu.:0e+00 3rd Qu.: 0.0
## Max. :1e+00 Max. :199308.0
##
## current_installment_accounts accounts_opened_24m
## Min. : 0.000 Min. : 0.000
## 1st Qu.: 1.000 1st Qu.: 2.000
## Median : 2.000 Median : 4.000
## Mean : 2.664 Mean : 4.376
## 3rd Qu.: 3.000 3rd Qu.: 6.000
## Max. :35.000 Max. :29.000
##
## months_since_last_credit_inquiry num_satisfactory_accounts
## Min. : 0.000 Min. : 0.00
## 1st Qu.: 2.000 1st Qu.: 7.00
## Median : 6.000 Median :10.00
## Mean : 7.341 Mean :11.38
## 3rd Qu.:11.000 3rd Qu.:14.00
## Max. :24.000 Max. :51.00
## NA's :1271
## num_accounts_120d_past_due num_accounts_30d_past_due num_active_debit_accounts
## Min. :0 Min. :0e+00 Min. : 0.000
## 1st Qu.:0 1st Qu.:0e+00 1st Qu.: 2.000
## Median :0 Median :0e+00 Median : 3.000
## Mean :0 Mean :1e-04 Mean : 3.595
## 3rd Qu.:0 3rd Qu.:0e+00 3rd Qu.: 5.000
## Max. :0 Max. :1e+00 Max. :32.000
## NA's :318
## total_debit_limit num_total_cc_accounts num_open_cc_accounts
## Min. : 0 Min. : 2.00 Min. : 0.000
## 1st Qu.: 10000 1st Qu.: 7.00 1st Qu.: 5.000
## Median : 19500 Median :11.00 Median : 7.000
## Mean : 27357 Mean :13.03 Mean : 8.095
## 3rd Qu.: 36100 3rd Qu.:17.00 3rd Qu.:10.000
## Max. :386700 Max. :66.00 Max. :46.000
##
## num_cc_carrying_balance num_mort_accounts account_never_delinq_percent
## Min. : 0.000 Min. : 0.000 Min. : 14.30
## 1st Qu.: 3.000 1st Qu.: 0.000 1st Qu.: 92.60
## Median : 5.000 Median : 1.000 Median :100.00
## Mean : 5.231 Mean : 1.383 Mean : 94.65
## 3rd Qu.: 7.000 3rd Qu.: 2.000 3rd Qu.:100.00
## Max. :43.000 Max. :14.000 Max. :100.00
##
## tax_liens public_record_bankrupt loan_purpose application_type
## Min. : 0.0000 Min. :0.0000 Length:10000 Length:10000
## 1st Qu.: 0.0000 1st Qu.:0.0000 Class :character Class :character
## Median : 0.0000 Median :0.0000 Mode :character Mode :character
## Mean : 0.0433 Mean :0.1238
## 3rd Qu.: 0.0000 3rd Qu.:0.0000
## Max. :52.0000 Max. :3.0000
##
## loan_amount term interest_rate installment
## Min. : 1000 Min. :36.00 Min. : 5.31 Min. : 30.75
## 1st Qu.: 8000 1st Qu.:36.00 1st Qu.: 9.43 1st Qu.: 256.04
## Median :14500 Median :36.00 Median :11.98 Median : 398.42
## Mean :16362 Mean :43.27 Mean :12.43 Mean : 476.21
## 3rd Qu.:24000 3rd Qu.:60.00 3rd Qu.:15.05 3rd Qu.: 644.69
## Max. :40000 Max. :60.00 Max. :30.94 Max. :1566.59
##
## grade sub_grade issue_month loan_status
## Length:10000 Length:10000 Length:10000 Length:10000
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## initial_listing_status disbursement_method balance paid_total
## Length:10000 Length:10000 Min. : 0 Min. : 0.0
## Class :character Class :character 1st Qu.: 6679 1st Qu.: 928.7
## Mode :character Mode :character Median :12379 Median : 1563.3
## Mean :14459 Mean : 2494.2
## 3rd Qu.:20690 3rd Qu.: 2616.0
## Max. :40000 Max. :41630.4
##
## paid_principal paid_interest paid_late_fees
## Min. : 0.0 Min. : 0.0 Min. : 0.0000
## 1st Qu.: 587.1 1st Qu.: 221.8 1st Qu.: 0.0000
## Median : 985.0 Median : 446.1 Median : 0.0000
## Mean : 1894.4 Mean : 599.7 Mean : 0.1195
## 3rd Qu.: 1694.6 3rd Qu.: 825.4 3rd Qu.: 0.0000
## Max. :40000.0 Max. :4216.4 Max. :52.9800
##
Analysis: For the analysis of my dataset I started by exploring the dataset as shown through the code above. After, I decided to clean my dataset starting with the NAs. For the variables I chose, a majority did not have any NAs. However, the variable debt_to_income did have NAs. There were 24 NAs, which is very small in comparison to the 10,000 observations this dataset had, so they won’t have much of an impact but I still fixed the NAs by replacing them with the median. After, I began preparing my data for my multiple linear regression. I selected the necessary variables that I would be using, I changed homeownership into a factor, and then had the loan amount arranged in ascending order.
#Check for any NAs for the variables I will be using
colSums(is.na(loans))
## emp_title emp_length
## 833 817
## state homeownership
## 0 0
## annual_income verified_income
## 0 0
## debt_to_income annual_income_joint
## 24 8505
## verification_income_joint debt_to_income_joint
## 8545 8505
## delinq_2y months_since_last_delinq
## 0 5658
## earliest_credit_line inquiries_last_12m
## 0 0
## total_credit_lines open_credit_lines
## 0 0
## total_credit_limit total_credit_utilized
## 0 0
## num_collections_last_12m num_historical_failed_to_pay
## 0 0
## months_since_90d_late current_accounts_delinq
## 7715 0
## total_collection_amount_ever current_installment_accounts
## 0 0
## accounts_opened_24m months_since_last_credit_inquiry
## 0 1271
## num_satisfactory_accounts num_accounts_120d_past_due
## 0 318
## num_accounts_30d_past_due num_active_debit_accounts
## 0 0
## total_debit_limit num_total_cc_accounts
## 0 0
## num_open_cc_accounts num_cc_carrying_balance
## 0 0
## num_mort_accounts account_never_delinq_percent
## 0 0
## tax_liens public_record_bankrupt
## 0 0
## loan_purpose application_type
## 0 0
## loan_amount term
## 0 0
## interest_rate installment
## 0 0
## grade sub_grade
## 0 0
## issue_month loan_status
## 0 0
## initial_listing_status disbursement_method
## 0 0
## balance paid_total
## 0 0
## paid_principal paid_interest
## 0 0
## paid_late_fees
## 0
#Fixing all NAs in the variables I will be using
median_dti <- median(loans$debt_to_income, na.rm = TRUE)
loans$debt_to_income[is.na(loans$debt_to_income)] <- median_dti
Note: The variables loan_amount, homeownership, annual_income do not have any missing data. However, the debt_to_income variable has only 24 which is very small in comparison to the 10,000 observations. I opted to fix the NAs by utilizing the median.
loans_for_model <- loans |>
select(homeownership, annual_income, debt_to_income, loan_amount) |>
mutate (homeownership = as.factor(homeownership)) |>
arrange(loan_amount)
model <- lm(loan_amount ~ homeownership + annual_income + debt_to_income, data = loans_for_model)
summary(model)
##
## Call:
## lm(formula = loan_amount ~ homeownership + annual_income + debt_to_income,
## data = loans_for_model)
##
## Residuals:
## Min 1Q Median 3Q Max
## -91406 -7132 -1954 5984 29121
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.190e+04 2.535e+02 46.958 < 2e-16 ***
## homeownershipOWN -1.305e+03 2.979e+02 -4.379 1.2e-05 ***
## homeownershipRENT -2.318e+03 2.123e+02 -10.919 < 2e-16 ***
## annual_income 5.186e-02 1.539e-03 33.700 < 2e-16 ***
## debt_to_income 7.357e+01 6.561e+00 11.213 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9614 on 9995 degrees of freedom
## Multiple R-squared: 0.1294, Adjusted R-squared: 0.1291
## F-statistic: 371.5 on 4 and 9995 DF, p-value: < 2.2e-16
Coefficients: Note: There are three kinds of homeownerships: OWN, RENT, and MORTGAGE. The MORTGAGE one is used as the baseline and OWN and RENT are extra indicators.
homeownershipOWN: The coefficient for those who own homes is -1,305 which means those who own home typically get loans amounts that is $1,305 smaller than the baseline (MORTGAGE).
homeownershipRENT: The coefficient for those who rent homes is -2318 which means those who rent their home typically get loan amounts that are $2,318 smaller than the baseline.
annual_income: The coefficient for annual_income is 0.05186. For every unit increase in annual income, there is an increase of $0.05 increase in loan amount.
debt_to_income: The coefficient for debt_to_income is 73.57. For every unit increase in debt to income, the loan amount increases by $73.57
1. Linearity Check
library(car)
## Warning: package 'car' was built under R version 4.5.3
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.5.3
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
crPlots(model)
Results: For the annual_income, there is a sharp upward trend followed by a downward trend, the linearity for this variable is not good. Therefore the predictions are most likely not reliable. For the debt_to_income, there is a steady upward trend so linearity is reasonable. For homeownership, it’s treated as a factor so linearity doesn’t apply the same way but it looks fine.
2. Independence
plot(resid(model), type="b",
main="Residuals vs Order", ylab="Residuals"); abline(h=0, lty=2)
Results: For the most part, most of the data is centered around the 0 with a few outbursts around 4400 and 10000. However, there is an upward trend in this graph, indicating that the data may not be independent.
3. Core diagnostics (covers: linearity, homoscedasticity, normality, influence)
par(mfrow=c(2,2)); plot(model); par(mfrow=c(1,1))
Residual vs Fitted: There is a clear downward trend and there is a funnel shape, meaning there is non-linearity.
Homoscedasticity (equal variance): As noted earlier by the residual vs fitted, there is a funnel shape which means there isn’t equal variance. There is heteroscedasticity.
Scale Location: There is a clear upward trend, meaning there is heteroscedasticity.
Q-Q: At the start there is left-tale deviation, but the rest is fine so there is slight non-normality.
Residuals vs Leverage: There are few high leverage points which can be influential.
4. Check Multicollinearity
cor(loans_for_model[, c("annual_income", "debt_to_income")], use = "complete.obs")
## annual_income debt_to_income
## annual_income 1.0000000 -0.1797765
## debt_to_income -0.1797765 1.0000000
Note: Since homeownership is a factor, I did not use it to check for multicollinearity
There is very low correlation between annual_income and debt_to_income at -0.18 so there isn’t high multicollinearity.
In conclusion, annual income was a significant positive predictor where people with higher incomes receive larger loans. The debt to income was also a significant positive predictor where people with larger debt to income receive larger loans. The home ownership status was also significant for this model where both renters and owners of homes get a smaller loan amount than those who’s home ownership status is mortgage holders. Based on my model these factors play a significant role as a predictor for loan amount.
This research is relevant as it can help people who intend to take loans out understand the factors that may play into the loan amount they may get.It can also be used to understand that borrowing capacity is a large factor in the amount people are able to take out.
There are quite a few limitations to this model because this model has poor abilities to make reliable predictions as it violates several requirements needed for a multiple linear regression model. My model does not have high multicollinearity, but there is non-normality, non-linearity, heteroscedasticity, and possible non-independence. The adjusted R² value is extremely low at only 0.1291, meaning this model can explain about 13% of the loan amount variation. Not to mention this model only looks at a few factors rather than the many other factors that may be key predictors.
For future improvements, the most important thing to focus on is building a more reliable model where there is normality, linearity, and proper independence. The current data I have is not helpful for developing reliable predictions therefore creating a new model would be more important.
References: https://www.openintro.org/data/index.php?data=loans_full_schema