library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.3 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 2.0.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
library(dplyr)
library(scales)
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
install.packages("qdap")
## Installing package into '/home/rstudio-user/R/x86_64-pc-linux-gnu-library/4.1'
## (as 'lib' is unspecified)
library(qdap)
## Loading required package: qdapDictionaries
## Loading required package: qdapRegex
##
## Attaching package: 'qdapRegex'
## The following object is masked from 'package:dplyr':
##
## explain
## The following object is masked from 'package:ggplot2':
##
## %+%
## Loading required package: qdapTools
##
## Attaching package: 'qdapTools'
## The following object is masked from 'package:dplyr':
##
## id
## Loading required package: RColorBrewer
##
## Attaching package: 'qdap'
## The following objects are masked from 'package:base':
##
## Filter, proportions
This workbook is an in class assignment. After completion you will have a good practical knowledge of Multiple Regression and Logistic Regression.
We will consider data about loans from the peer-to-peer lender, Lending Club. The data is contained in the loan.csv file contained in the project directory.
loanloan <- read.csv("loan.csv")
You may find the following variable dictionary useful.
loan.| variable | description |
|---|---|
interest_rate |
Interest rate for the loan |
income_ver |
Categorical variable describing whether the borrower’s income source and amount have been verified, levels verified,source_only, not |
dept_to_income |
Debt-to-income ratio, which is the percent of total debt of the borrower divided by their total income. |
credit_util |
Of all the Credit available to the borrower, what fraction are they using. For example the credit utilization on the credit card would be the card’s balance divided by the card’s credit limit |
bankruptcy |
An indicator variable for whether the borrower has a past bankruptcy in her record. This variable takes a value of 1 if the answer is “yes” and 0 if the answer is “no”. |
term |
The length of the loan, in months. |
issued |
The month and year the loan was issued. |
credit_checks |
Number of credit checks in the last 12 months. For example, when filing an application for a credit cards, it is common for the company receiving the application to run a credit check. |
Recall the single variable models we have been studying. The prediction looks like this. \[ \hat{Y} = \hat{\beta_0} + \hat{\beta_1}X\] Where the variables denoted by the \(\hat{\,}\) over them are the estimates obtained by lm.
In every case we have, more or less implicitly, assumed that X is a numeric variable. Would this make any sense if \(X\) is a categorical variable?
income_ver, what are the possible values of income_ver and what is the poportion of all records in each category of income_ver? Use code to find this, do not use the above dictionary, it might be wrong or out of date.income_ver_total <- nrow(loan)
income_ver_total
## [1] 10000
income_ver_summary <- loan %>%
group_by(income_ver) %>%
summarise(n = n())
income_ver_summary
income_ver_summary <- income_ver_summary %>%
mutate(proportion = percent(n/income_ver_total))
income_ver_summary
So if we are interested in determining the variation of interest_rate as a function of income_ver. What does lm mean for problems like this?
To understand how to handle categorical variables, we will start with special type of categorical variable called an indicator variable. We have such a variable in the loan data set, it’s called bankruptcy and it takes the value 0 if the applicant has had no previous bankruptcies and 1 if the applicant has had at leat one previous bankruptcy.
group_by(loan, bankruptcy) %>%
summarise(proportion = percent(n()/nrow(loan)))
income_ver and bankruptcy? #Bankruptcy is an integer and income_ver a character.Given a value \(x\) of bankruptcy does it makes sense to multiply \(x\) by a number, e.g., \(2.3*x\)? Hint: consider the possible values of \(x\)? # Since bankruptcy is either 0 or 1 it does not make sense to multiply it by a number. You’ll always get zero or that number.
Summarize the results of regressing interest_rate on bankruptcy using lm
regression1 <- lm(interest_rate ~ bankruptcy, loan)
summary(regression1)
##
## Call:
## lm(formula = interest_rate ~ bankruptcy, data = loan)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.7648 -3.6448 -0.4548 2.7120 18.6020
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.3380 0.0533 231.490 < 2e-16 ***
## bankruptcy 0.7368 0.1529 4.819 1.47e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.996 on 9998 degrees of freedom
## Multiple R-squared: 0.002317, Adjusted R-squared: 0.002217
## F-statistic: 23.22 on 1 and 9998 DF, p-value: 1.467e-06
Interpret the results of this regression. #There is a signicant relationship between bankruptcy and interest_rate given the low p-value, but r-squared is also very low so the model is not very precise.
group_by(loan, bankruptcy) %>%
summarise(avg_interest_eate = mean(interest_rate)) %>%
filter(bankruptcy == 0)
How can you explain this value? # If x sometimes equals 0, the intercept is the expected mean value of y at that value.
What is the average interest rate for people who have had at least one bankruptcy?
group_by(loan, bankruptcy) %>%
summarise(avg_interest_eate = mean(interest_rate)) %>%
filter(bankruptcy == 1)
Could you have determined this value from the regression summary? If so, how? # Yes. 13.07479 = 12.3380 + 1 * 0.7368
How do you interpret the meaning of slope in this context? # For every unit of x, y increases by 0.7368.
Which is more important the estimated slope or \(R^2\)? # Slope determines the correlation coefficient (-1 perfect neg relationshiop, 0 no relationship, 1 perfect pos relationship) and if there is a relationmship at all. R squared measures the fit of the model. R squared is meaningless without understanding the effect of the slope.
interest_rate on income_verregression2 <- lm(interest_rate ~ income_ver, loan)
summary(regression2)
##
## Call:
## lm(formula = interest_rate ~ income_ver, data = loan)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.0437 -3.7495 -0.6795 2.5345 19.6905
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.09946 0.08091 137.18 <2e-16 ***
## income_versource_only 1.41602 0.11074 12.79 <2e-16 ***
## income_ververified 3.25429 0.12970 25.09 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.851 on 9997 degrees of freedom
## Multiple R-squared: 0.05945, Adjusted R-squared: 0.05926
## F-statistic: 315.9 on 2 and 9997 DF, p-value: < 2.2e-16
Response Question How many variables does the regression say we have? # Three.
Coding Question Calculate mean(interest_rate) for each level of the categorical variable income_ver
group_by(loan, income_ver) %>%
summarise(avg_int_rate = mean(interest_rate))
How do your values correspond to the coefficient values in the regression? # The y value is the same, the other variables are quite different from their coefficients but within std error.
Given what you know now why do you think you have two new variables related to income_ver? # income_ver is dependent on multiple other independent variables, or more than one variable explains what is happening to income_ver.
Do you get an extra variable in case of single variable regression? If so what is it? # Single variable regression has one dependent variable and one independent variable.
How would you write a regression model for the regression of interest_rate on income_verified? # Y = a + bX # Y = income_verified # X = interest_rate
Multiple regression means that we are regressing on a sum of variables. In fact when we regress on a categorical variable we are doing multiple regression! Why? # You need to do a regression for each category of which there must be at least 2.
Since the loan data set gives us a lot of variables let’s try regression on all of them. (run the code below)
summary(lm(interest_rate ~ income_ver + debt_to_income + credit_util + bankruptcy + term + issued + credit_checks, data = loan))
##
## Call:
## lm(formula = interest_rate ~ income_ver + debt_to_income + credit_util +
## bankruptcy + term + issued + credit_checks, data = loan)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.9070 -3.4362 -0.7239 2.5397 18.0874
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.969e+00 2.087e-01 19.016 < 2e-16 ***
## income_versource_only 1.083e+00 1.036e-01 10.456 < 2e-16 ***
## income_ververified 2.482e+00 1.223e-01 20.293 < 2e-16 ***
## debt_to_income 3.787e-02 3.121e-03 12.137 < 2e-16 ***
## credit_util -4.323e-06 8.733e-07 -4.950 7.54e-07 ***
## bankruptcy 5.043e-01 1.383e-01 3.645 0.000269 ***
## term 1.495e-01 4.123e-03 36.257 < 2e-16 ***
## issuedJan2018 -2.061e-02 1.128e-01 -0.183 0.854969
## issuedMar2018 -9.280e-02 1.112e-01 -0.834 0.404037
## credit_checks 2.233e-01 1.917e-02 11.647 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.489 on 9966 degrees of freedom
## (24 observations deleted due to missingness)
## Multiple R-squared: 0.1942, Adjusted R-squared: 0.1934
## F-statistic: 266.8 on 9 and 9966 DF, p-value: < 2.2e-16
Comment on the \(p\)-values for the variables, noting anything interesting. # All of the variables are significant except ‘issuedJan2018’ and ‘issuedMar2018’. And although bankruptcy is signficant, it’s a much higher p-value than the other significant variables.
Comment on the difference between \(R^2\) and adj-\(R^2\). Which do you think is more variable. Do you think the value of either \(R^2\) is too low to make this model useful? # Adjusted r-squared is a better measure of the significant variables. For instance if I recall the lm regression but remove ‘issued’ the R-Squared goes down, but Adjusted R-Squared goes up.
regression3 <- lm(formula = interest_rate ~ income_ver + debt_to_income + credit_util +
bankruptcy + term + credit_checks, data = loan)
summary(regression3)
##
## Call:
## lm(formula = interest_rate ~ income_ver + debt_to_income + credit_util +
## bankruptcy + term + credit_checks, data = loan)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.8988 -3.4316 -0.7204 2.5421 18.1083
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.928e+00 1.960e-01 20.042 < 2e-16 ***
## income_versource_only 1.080e+00 1.035e-01 10.436 < 2e-16 ***
## income_ververified 2.479e+00 1.223e-01 20.278 < 2e-16 ***
## debt_to_income 3.790e-02 3.120e-03 12.148 < 2e-16 ***
## credit_util -4.319e-06 8.733e-07 -4.946 7.68e-07 ***
## bankruptcy 5.055e-01 1.383e-01 3.654 0.00026 ***
## term 1.495e-01 4.122e-03 36.272 < 2e-16 ***
## credit_checks 2.233e-01 1.917e-02 11.652 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.488 on 9968 degrees of freedom
## (24 observations deleted due to missingness)
## Multiple R-squared: 0.1941, Adjusted R-squared: 0.1935
## F-statistic: 342.9 on 7 and 9968 DF, p-value: < 2.2e-16
The multi-variate model we have just constructed is not necessarily the best model.
Why might this be the case? # ‘issued’ variable is not a signficant variable and could be making the model less precise.
How might we try to improve things? # Remove ‘issued’ and maybe ‘bankruptcy’.
Let’s look at the coefficient estimates for the full model (the one with all the variables)
coef(summary(lm(interest_rate ~ income_ver + debt_to_income + credit_util + bankruptcy + term + issued + credit_checks, data = loan)))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.969028e+00 2.087232e-01 19.0157518 3.163114e-79
## income_versource_only 1.082767e+00 1.035565e-01 10.4558002 1.866232e-25
## income_ververified 2.482072e+00 1.223104e-01 20.2932296 9.462920e-90
## debt_to_income 3.787448e-02 3.120564e-03 12.1370626 1.160422e-33
## credit_util -4.323176e-06 8.733479e-07 -4.9501194 7.538276e-07
## bankruptcy 5.042764e-01 1.383482e-01 3.6449793 2.687739e-04
## term 1.495000e-01 4.123314e-03 36.2572457 1.696856e-270
## issuedJan2018 -2.060967e-02 1.127529e-01 -0.1827863 8.549694e-01
## issuedMar2018 -9.279620e-02 1.112041e-01 -0.8344677 4.040375e-01
## credit_checks 2.232706e-01 1.916941e-02 11.6472314 3.770754e-31
Response Question Look at the \(p\)-values. What are the largest what are smallest. Are any of the variable estimates not significant at the 95% level? #Largest p-value are both ‘issued’ variables, followed by ‘bankruptcy’ and ‘credit_util’ respectively. Smallest are ‘term’, ‘income_verified’, and ‘credit_checks’.
Coding Question Rerun the regression leaving out the non-significant variables.
## only deleting 'issued' from the new model.
regression4 <- lm(formula = interest_rate ~ income_ver + debt_to_income + credit_util +
bankruptcy + term + credit_checks, data = loan)
summary(regression4)
##
## Call:
## lm(formula = interest_rate ~ income_ver + debt_to_income + credit_util +
## bankruptcy + term + credit_checks, data = loan)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.8988 -3.4316 -0.7204 2.5421 18.1083
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.928e+00 1.960e-01 20.042 < 2e-16 ***
## income_versource_only 1.080e+00 1.035e-01 10.436 < 2e-16 ***
## income_ververified 2.479e+00 1.223e-01 20.278 < 2e-16 ***
## debt_to_income 3.790e-02 3.120e-03 12.148 < 2e-16 ***
## credit_util -4.319e-06 8.733e-07 -4.946 7.68e-07 ***
## bankruptcy 5.055e-01 1.383e-01 3.654 0.00026 ***
## term 1.495e-01 4.122e-03 36.272 < 2e-16 ***
## credit_checks 2.233e-01 1.917e-02 11.652 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.488 on 9968 degrees of freedom
## (24 observations deleted due to missingness)
## Multiple R-squared: 0.1941, Adjusted R-squared: 0.1935
## F-statistic: 342.9 on 7 and 9968 DF, p-value: < 2.2e-16
Compare this more parsimonious model with the full model. What are your observations # R-Squared is lower but Adjust R-Squared is higher.
If you were to delete a variable from this model, which one would you delete? # I would not delete anything from the new model.
summary(regression4)
##
## Call:
## lm(formula = interest_rate ~ income_ver + debt_to_income + credit_util +
## bankruptcy + term + credit_checks, data = loan)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.8988 -3.4316 -0.7204 2.5421 18.1083
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.928e+00 1.960e-01 20.042 < 2e-16 ***
## income_versource_only 1.080e+00 1.035e-01 10.436 < 2e-16 ***
## income_ververified 2.479e+00 1.223e-01 20.278 < 2e-16 ***
## debt_to_income 3.790e-02 3.120e-03 12.148 < 2e-16 ***
## credit_util -4.319e-06 8.733e-07 -4.946 7.68e-07 ***
## bankruptcy 5.055e-01 1.383e-01 3.654 0.00026 ***
## term 1.495e-01 4.122e-03 36.272 < 2e-16 ***
## credit_checks 2.233e-01 1.917e-02 11.652 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.488 on 9968 degrees of freedom
## (24 observations deleted due to missingness)
## Multiple R-squared: 0.1941, Adjusted R-squared: 0.1935
## F-statistic: 342.9 on 7 and 9968 DF, p-value: < 2.2e-16
What are your observations? # All variables are significant and the Adjusted R-squared is as high as it can be with these variables. All of these variables are significant in how interest rate is determined in this model.