library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5.9000 ✓ purrr 0.3.4
## ✓ tibble 3.1.3 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 2.0.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
This workbook is an in class assignment. After completion you will have a good practical knowledge of Multiple Regression and Logistic Regression.
There are two types of exercises in this Tutorial.
Response questions These questions are thought exercises and require a written response.
Response Question Example: B
What is the Capital of North Dakota?
Response: Bismarck
Coding questions These questions require entering R code in the provided chunck.
Coding question example
Write the R code to calculate the average of the whole numbers from 1 to 10, enter your code here:
mean(1:10)
## [1] 5.5
We will consider data about loans from the peer-to-peer lender, Lending Club. The data is contained in the loan.csv file contained in the project directory.
loanloan <- read.csv("loan.csv")
You may find the following variable dictionary useful.
loan.| variable | description |
|---|---|
interest_rate |
Interest rate for the loan |
income_ver |
Categorical variable describing whether the borrower’s income source and amount have been verified, levels: verified,source_only, not |
dept_to_income |
Debt-to-income ratio, which is the percent of total debt of the borrower divided by their total income. |
credit_util |
Of all the Credit available to the borrower, what fraction are they using. For example the credit utilization on the credit card would be the card’s balance divided by the card’s credit limit |
bankruptcy |
An indicator variable for whether the borrower has a past bankruptcy in her record. This variable takes a value of 1 if the answer is “yes” and 0 if the answer is “no”. |
term |
The length of the loan, in months. |
issued |
The month and year the loan was issued. |
credit_checks |
Number of credit checks in the last 12 months. For example, when filing an application for a credit cards, it is common for the company receiving the application to run a credit check. |
Recall the single variable models we have been studying. The prediction looks like this. \[ \hat{Y} = \hat{\beta_0} + \hat{\beta_1}X\] Where the variables denoted by the \(\hat{\,}\) over them are the estimates obtained by lm.
In every case we have, more or less implicitly, assumed that X is a numeric variable. Would this make any sense if \(X\) is a categorical variable?
Yes
A categorical variable is a variable that can only take upa finite or fixed number of possible values (ex: male and female)
income_ver, what are the possible values of income_ver and what is the proportion of all records in each category of income_ver? Use code to find this, do not use the above dictionary, it might be wrong or out of date.unique(loan$income_ver)
## [1] "verified" "not" "source_only"
loan1 <- loan %>%
select("income_ver")
loan2 <- table(loan1)
prop.table(loan2)
## loan1
## not source_only verified
## 0.3594 0.4116 0.2290
##The possible values are "not", "source_only", and "verified",
##The proportion of the categories are about 0.3594, 0.4116, and 0.2290 respectively
So if we are interested in determining the variation of interest_rate as a function of income_ver. What does lm mean for problems like this?
A linear model to predict the interest_rate as income_ver increases (after the categories are given numerical values)
To understand how to handle categorical variables, we will start with special type of categorical variable called an indicator variable. We have such a variable in the loan data set, it’s called bankruptcy and it takes the value 0 if the applicant has had no previous bankruptcies and 1 if the applicant has had at leat one previous bankruptcy.
bankruptcy1 <- loan %>%
select("bankruptcy")
bankruptcy2 <- table(bankruptcy1)
prop.table(bankruptcy2)
## bankruptcy1
## 0 1
## 0.8785 0.1215
income_ver and bankruptcy?“income_ver” describes whether the borrower’s income source and amount had been verified. While “bankruptcy” indicates whether the borrower has a past bankruptcy in her record. Both variables are categorical, however, “income_ver” is described with characters while “bankruptcy” is described with numerics.
Given a value \(x\) of bankruptcy does it makes sense to multiply \(x\) by a number, e.g., \(2.3*x\)? Hint: consider the possible values of \(x\)
Yes, since the options are either 0 or 1 to represent the two different categories.
Summarize the results of regressing interest_rate on bankruptcy using lm
simple.fit = lm(interest_rate ~ bankruptcy, data = loan)
summary(simple.fit)
##
## Call:
## lm(formula = interest_rate ~ bankruptcy, data = loan)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.7648 -3.6448 -0.4548 2.7120 18.6020
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.3380 0.0533 231.490 < 2e-16 ***
## bankruptcy 0.7368 0.1529 4.819 1.47e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.996 on 9998 degrees of freedom
## Multiple R-squared: 0.002317, Adjusted R-squared: 0.002217
## F-statistic: 23.22 on 1 and 9998 DF, p-value: 1.467e-06
Interpret the results of this regression.
The estimated intercept, or the average interest rate of those who have had no bankruptcies is 12.3380. And the average interest rate of those with at least one bankruptcy is \(12.3380 + 0.7368 = 13.0748\). The \(p\)-value for the variable “bankruptcy” is very low, indicating significance. This suggests that there is statistical evidence of a difference in average interest rates between those with 0 and at least one bankruptcy.
avg_bank <- loan %>%
select("interest_rate", "bankruptcy")
avg_bank1 <- avg_bank %>%
filter(bankruptcy == 0) %>%
summarize(Avg_Interest_Rate = mean(interest_rate))
avg_bank1
It is virtually the same as the estimated intercept.
How can you explain this value?
The average interest rate of those who have had 0 bankruptcies is 12.338.
What is the average interest rate for people who have had at least one bankruptcy?
avg_banks <- loan %>%
select("interest_rate", "bankruptcy")
avg_banks1 <- avg_bank %>%
filter(bankruptcy == 1) %>%
summarize(Avg_Interest_Rate = mean(interest_rate))
avg_banks1
Yes, by adding the value 0.7368 from the estimated intercept in the regression summary, the average interest rate for those with at least 1 bankruptcy could have been predicted.
The average interest rate for those with at least one bankruptcy is 0.7368 higher than those without any bankruptcies.
\(R^2\) is more important that the estimated slope because \(R^2\) represents how much of the data can be interpreted by the line of best fit. And since the point of a linear model is to predict and understand the relationship between 2 variables, having a high \(R^2\) value would be important thatn the slope at any time.
interest_rate on income_versimple_fit = lm(interest_rate ~ income_ver, data = loan)
summary(simple_fit)
##
## Call:
## lm(formula = interest_rate ~ income_ver, data = loan)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.0437 -3.7495 -0.6795 2.5345 19.6905
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.09946 0.08091 137.18 <2e-16 ***
## income_versource_only 1.41602 0.11074 12.79 <2e-16 ***
## income_ververified 3.25429 0.12970 25.09 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.851 on 9997 degrees of freedom
## Multiple R-squared: 0.05945, Adjusted R-squared: 0.05926
## F-statistic: 315.9 on 2 and 9997 DF, p-value: < 2.2e-16
3
mean(interest_rate) for each level of the categorical variable income_veravg_ver <- loan %>%
select("interest_rate", "income_ver")
avg_ver1 <- avg_ver %>%
group_by(income_ver) %>%
summarize(Avg_Interest_Rate = mean(interest_rate))
avg_ver1
The intercept is estimated as 11.09946 (which matches the average interest rate for that of “not”). For “source_only”, it is estimated as \(11.09946 + 1.41602 = 12.51548\) (which matches its respective one as well). And lastly, “verified” it is estimated as \(11.09946 + 3.25429 = 14.35375\).
income_ver?There are 3 categories in total (not, source_only, and verified). So the base is “not” and the other 2 are based off of this category.
No
interest_rate on income_verified?\[ \hat{Y} = \hat{\beta_0} + \hat{\beta_1}X_1 + \hat{\beta_2}X_2 + \hat{\beta_3}X_3\]
Multiple regression means that we are regressing on a sum of variables. In fact when we regress on a categorical variable we are doing multiple regression! Why?
Since the loan data set gives us a lot of variables let’s try regression on all of them. (run the code below)
summary(lm(interest_rate ~ income_ver + debt_to_income + credit_util + bankruptcy + term + issued + credit_checks, data = loan))
##
## Call:
## lm(formula = interest_rate ~ income_ver + debt_to_income + credit_util +
## bankruptcy + term + issued + credit_checks, data = loan)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.9070 -3.4362 -0.7239 2.5397 18.0874
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.969e+00 2.087e-01 19.016 < 2e-16 ***
## income_versource_only 1.083e+00 1.036e-01 10.456 < 2e-16 ***
## income_ververified 2.482e+00 1.223e-01 20.293 < 2e-16 ***
## debt_to_income 3.787e-02 3.121e-03 12.137 < 2e-16 ***
## credit_util -4.323e-06 8.733e-07 -4.950 7.54e-07 ***
## bankruptcy 5.043e-01 1.383e-01 3.645 0.000269 ***
## term 1.495e-01 4.123e-03 36.257 < 2e-16 ***
## issuedJan2018 -2.061e-02 1.128e-01 -0.183 0.854969
## issuedMar2018 -9.280e-02 1.112e-01 -0.834 0.404037
## credit_checks 2.233e-01 1.917e-02 11.647 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.489 on 9966 degrees of freedom
## (24 observations deleted due to missingness)
## Multiple R-squared: 0.1942, Adjusted R-squared: 0.1934
## F-statistic: 266.8 on 9 and 9966 DF, p-value: < 2.2e-16
Most of them have a \(p\)-value of \(2e^-16\), but there are some who have much larger \(p\)-values.
The adjusted \(R^2\) value is slightly smaller at 0.1934 compared to 0.1942. This indicates that the \(R^2\) has slightly less variation. Both these values seem too small to be a useful model, eliminating the \(p\)-values that are significantly larger may have a positive effect on this.
The multi-variate model we have just constructed is not necessarily the best model.
Why might this be the case?
How might we try to improve things?
Let’s look at the coefficient estimates for the full model (the one with all the variables)
coef(summary(lm(interest_rate ~ income_ver + debt_to_income + credit_util + bankruptcy + term + issued + credit_checks, data = loan)))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.969028e+00 2.087232e-01 19.0157518 3.163114e-79
## income_versource_only 1.082767e+00 1.035565e-01 10.4558002 1.866232e-25
## income_ververified 2.482072e+00 1.223104e-01 20.2932296 9.462920e-90
## debt_to_income 3.787448e-02 3.120564e-03 12.1370626 1.160422e-33
## credit_util -4.323176e-06 8.733479e-07 -4.9501194 7.538276e-07
## bankruptcy 5.042764e-01 1.383482e-01 3.6449793 2.687739e-04
## term 1.495000e-01 4.123314e-03 36.2572457 1.696856e-270
## issuedJan2018 -2.060967e-02 1.127529e-01 -0.1827863 8.549694e-01
## issuedMar2018 -9.279620e-02 1.112041e-01 -0.8344677 4.040375e-01
## credit_checks 2.232706e-01 1.916941e-02 11.6472314 3.770754e-31
The largest \(p\)-values are from the “issuedJan2018” and “issuedMar2018”. While the lowest is that of the “term”. The two largest \(p\)-values are the only ones that do not fall under a value of 0.05, therefore failing to match the 95% confidence interval criteria.
summary(lm(interest_rate ~ income_ver + debt_to_income + credit_util + bankruptcy + term + credit_checks, data = loan))
##
## Call:
## lm(formula = interest_rate ~ income_ver + debt_to_income + credit_util +
## bankruptcy + term + credit_checks, data = loan)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.8988 -3.4316 -0.7204 2.5421 18.1083
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.928e+00 1.960e-01 20.042 < 2e-16 ***
## income_versource_only 1.080e+00 1.035e-01 10.436 < 2e-16 ***
## income_ververified 2.479e+00 1.223e-01 20.278 < 2e-16 ***
## debt_to_income 3.790e-02 3.120e-03 12.148 < 2e-16 ***
## credit_util -4.319e-06 8.733e-07 -4.946 7.68e-07 ***
## bankruptcy 5.055e-01 1.383e-01 3.654 0.00026 ***
## term 1.495e-01 4.122e-03 36.272 < 2e-16 ***
## credit_checks 2.233e-01 1.917e-02 11.652 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.488 on 9968 degrees of freedom
## (24 observations deleted due to missingness)
## Multiple R-squared: 0.1941, Adjusted R-squared: 0.1935
## F-statistic: 342.9 on 7 and 9968 DF, p-value: < 2.2e-16
Compare this more parsimonious model with the full model. What are your observations?
The \(R^2\) values actually decreased from before even though all the \(p\)-values fall under 0.05. However, the adjusted \(R^2\) increased by 0.0001.
If you were to delete a variable from this model, which one would you delete?
Bankruptcy (and possibly credit_util)
summary(lm(interest_rate ~ income_ver + debt_to_income + term + credit_checks, data = loan))
##
## Call:
## lm(formula = interest_rate ~ income_ver + debt_to_income + term +
## credit_checks, data = loan)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.5036 -3.4305 -0.6827 2.5546 17.8843
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.91796 0.19552 20.04 <2e-16 ***
## income_versource_only 1.06664 0.10365 10.29 <2e-16 ***
## income_ververified 2.46688 0.12246 20.14 <2e-16 ***
## debt_to_income 0.03440 0.00304 11.32 <2e-16 ***
## term 0.14807 0.00412 35.94 <2e-16 ***
## credit_checks 0.21818 0.01901 11.48 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.497 on 9970 degrees of freedom
## (24 observations deleted due to missingness)
## Multiple R-squared: 0.1908, Adjusted R-squared: 0.1904
## F-statistic: 470.2 on 5 and 9970 DF, p-value: < 2.2e-16
What are your observations?
The \(R^2\) value once again decreases slightly. But the adjusted \(R^2\) value is close to that of the \(R^2\) value this time. This maybe because the number of terms within the model decreased, therefore bringing the two closer. In conclusion, removing a variable from a regression will not increase the \(R^2\) because it is eliminating possible variables that are affecting the predicted Y. And therefore, the model will not represent the actual data to its fullest. By examining the equation of \(R^2\), one can see that adding a variable cannot decrease the residual sum of the squares.