library(tidyverse)
Registered S3 methods overwritten by 'dbplyr':
  method         from
  print.tbl_lazy     
  print.tbl_sql      
── Attaching packages ──────────────────────────────────────────────────── tidyverse 1.3.1 ──
✓ ggplot2 3.3.5     ✓ purrr   0.3.4
✓ tibble  3.1.2     ✓ dplyr   1.0.7
✓ tidyr   1.1.3     ✓ stringr 1.4.0
✓ readr   1.4.0     ✓ forcats 0.5.1
── Conflicts ─────────────────────────────────────────────────────── tidyverse_conflicts() ──
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()
library(openintro)
Loading required package: airports
Loading required package: cherryblossom
Loading required package: usdata

Introduction

This workbook is an in class assignment. After completion you will have a good practical knowledge of Multiple Regression and Logistic Regression.

There are two types of exercises in this Tutorial.

Response questions These questions are thought exercises and require a written response.

Response Question Example:

What is the Capital of North Dakota?

Response: Bismarck

Coding questions These questions require entering R code in the provided chunck.

Coding question example

Write the R code to calculate the average of the whole numbers from 1 to 10, enter your code here:

mean(1:10)
[1] 5.5

Section 1 Loans

We will consider data about loans from the peer-to-peer lender, Lending Club. The data is contained in the loan.csv file contained in the project directory.

  1. Coding Question 1 load the data into a new tibble called loan
loan <- read_csv("loan.csv")

── Column specification ─────────────────────────────────────────────────────────────────────
cols(
  interest_rate = col_double(),
  income_ver = col_character(),
  debt_to_income = col_double(),
  credit_util = col_double(),
  bankruptcy = col_double(),
  term = col_double(),
  issued = col_character(),
  credit_checks = col_double()
)

You may find the following variable dictionary useful.

Variable dictionary for loan.

variable description
interest_rate Interest rate for the loan
income_ver Categorical variable describing whether the borrower’s income source and amount have been verified, levels verified,source_only, not
dept_to_income Debt-to-income ratio, which is the percent of total debt of the borrower divided by their total income.
credit_util Of all the Credit available to the borrower, what fraction are they using. For example the credit utilization on the credit card would be the card’s balance divided by the card’s credit limit
bankruptcy An indicator variable for whether the borrower has a past bankruptcy in her record. This variable takes a value of 1 if the answer is “yes” and 0 if the answer is “no”.
term The length of the loan, in months.
issued The month and year the loan was issued.
credit_checks Number of credit checks in the last 12 months. For example, when filing an application for a credit cards, it is common for the company receiving the application to run a credit check.

Using Categorical Variables as Predictors

Recall the single variable models we have been studying. The prediction looks like this. \[ \hat{Y} = \hat{\beta_0} + \hat{\beta_1}X\] Where the variables denoted by the \(\hat{\,}\) over them are the estimates obtained by lm.

In every case we have, more or less implicitly, assumed that X is a numeric variable. Would this make any sense if \(X\) is a categorical variable?

  1. Response Question What is a categorical variable?

A variable is categorical if the responses are categories. The responses are not numeric and they are not continuous. If you use the summary function on the data, you will not see a min, max, median, mean, 1st quartile, and 3rd quartile for categorical variables.

  1. Coding Question
    Consider the variable income_ver, what are the possible values of income_ver and what is the proportion of all records in each category of income_ver? Use code to find this, do not use the above dictionary, it might be wrong or out of date.

I can see by using summary function that income_ver is a categorical variable because it doesn’t have the min/max/median/mean output. I can also do the view function and see for myself that income_ver looks like a categorical variable.

view(loan)
summary(loan)
 interest_rate    income_ver        debt_to_income    credit_util       bankruptcy    
 Min.   : 5.31   Length:10000       Min.   :  0.00   Min.   :     0   Min.   :0.0000  
 1st Qu.: 9.43   Class :character   1st Qu.: 11.06   1st Qu.: 19186   1st Qu.:0.0000  
 Median :11.98   Mode  :character   Median : 17.57   Median : 36927   Median :0.0000  
 Mean   :12.43                      Mean   : 19.31   Mean   : 51049   Mean   :0.1215  
 3rd Qu.:15.05                      3rd Qu.: 25.00   3rd Qu.: 65421   3rd Qu.:0.0000  
 Max.   :30.94                      Max.   :469.09   Max.   :942456   Max.   :1.0000  
                                    NA's   :24                                        
      term          issued          credit_checks   
 Min.   :36.00   Length:10000       Min.   : 0.000  
 1st Qu.:36.00   Class :character   1st Qu.: 0.000  
 Median :36.00   Mode  :character   Median : 1.000  
 Mean   :43.27                      Mean   : 1.958  
 3rd Qu.:60.00                      3rd Qu.: 3.000  
 Max.   :60.00                      Max.   :29.000  
                                                    

So if we are interested in determining the variation of interest_rate as a function of income_ver. What does lm mean for problems like this?

To understand how to handle categorical variables, we will start with special type of categorical variable called an indicator variable. We have such a variable in the loan data set, it’s called bankruptcy and it takes the value 0 if the applicant has had no previous bankruptcies and 1 if the applicant has had at least one previous bankruptcy.

  1. Coding Question Find the proportion of applicants with no bankruptcy and the proportion of applicants who have had at least one bankruptcy.
count(loan, bankruptcy == 1)
count(loan, bankruptcy == 0)

The bankruptcy variable is set up such that it says 1 if there has been a past bankruptcy and 0 if there hasn’t been one. I can use the count function to find out how many times bankruptcy equals 1 (or how many times “bankruptcy = 1” is true) and do the same for 0. The number of FALSE for bankruptcy equals 1 is the same number as the number of TRUE for bankruptcy equals 0 which proves that there are only 1 and 0 responses in the bankruptcy column.

  1. Response Questions What is the difference between the variables income_ver and bankruptcy?

Bankruptcy is an example of numeric and discrete variable. While the responses are numbers, there are limited options (only 1 and 0). A numeric and continuous variable would not have such restrictions on the numeric values. Income_ver is a categorical variable and does not have numeric responses like bankruptcy does.

Given a value \(x\) of bankruptcy does it makes sense to multiply \(x\) by a number, e.g., \(2.3*x\)? Hint: consider the possible values of \(x\)

I do not think it makes sense to modify the bankruptcy value with multiplication because it is simpler and more straightforward to have the 1 represent past bankruptcies and 0 represent no past bankruptcies. Additionally, the multiplication would just change the 1 into 2.3 (or another value) and not change the 0 at all, so multiplication wouldn’t make a significant change.

  1. Coding Question

Summarize the results of regressing interest_rate on bankruptcy using lm

fit_ir <- lm(interest_rate ~ bankruptcy, data = loan)
summary(fit_ir)

Call:
lm(formula = interest_rate ~ bankruptcy, data = loan)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.7648 -3.6448 -0.4548  2.7120 18.6020 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  12.3380     0.0533 231.490  < 2e-16 ***
bankruptcy    0.7368     0.1529   4.819 1.47e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.996 on 9998 degrees of freedom
Multiple R-squared:  0.002317,  Adjusted R-squared:  0.002217 
F-statistic: 23.22 on 1 and 9998 DF,  p-value: 1.467e-06
  1. Response Question

Interpret the results of this regression.

The slope coefficient is 0.7368 which is what we expect the model to adjust the interest rate by if bankruptcy = 1. The intercept 12.338 tells us what the interest rate would be if x (bankruptcy) is 0.

  1. Coding Question What is the average interest rate for applicants who have had no bankruptcies?

The average interest rate for applicants that have had no bankruptcies is 12.338. Comparatively, the mean interest rate for the entire dataset (where bankruptcy = 1 or 0) is 12.43.

ir_b <- filter(loan, bankruptcy == 0)
mean(ir_b$interest_rate)
[1] 12.338
  1. Response Question How does this value compare with the estimated intercept?

The average interest rate for applicants that have had no bankruptcies is 12.338. The intercept is 12.338 too.

How can you explain this value?

The intercept is supposed to be the average value of y if x = 0. Therefore if x is bankruptcy and y is interest rate, this value makes sense.

  1. Coding Question

What is the average interest rate for people who have had at least one bankruptcy?

The average interest rate for people who have had at least one bankruptcy is 13.07.

ir_b1 <- filter(loan, bankruptcy == 1)
mean(ir_b1$interest_rate)
[1] 13.07479
  1. Could you have determined this value from the regression summary? If so, how?

Yes. You can solve for y using x = 1 (at least 1 bankruptcy) and use the slope and intercept. The answer is 13.0748 which rounds to 13.07.

  1. How do you interpret the meaning of slope in this context?

The slope is the part of the model that is adjusted based on the value of x. Both the slope and the intercept account for how y varies as x varies.

  1. Which is more important the estimated slope or \(R^2\)?

I am not sure whether one is more important than the other since one can be used to help find the other. However value R can tell you more about correlation and the strength of the linear trend so R^2 may be more useful.

Now for Categorical variables in general.

  1. Coding Question Print the summary of regressing interest_rate on income_ver
fit_iv <- lm(interest_rate ~ income_ver, data = loan)
summary(fit_iv)

Call:
lm(formula = interest_rate ~ income_ver, data = loan)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.0437 -3.7495 -0.6795  2.5345 19.6905 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)           11.09946    0.08091  137.18   <2e-16 ***
income_versource_only  1.41602    0.11074   12.79   <2e-16 ***
income_ververified     3.25429    0.12970   25.09   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.851 on 9997 degrees of freedom
Multiple R-squared:  0.05945,   Adjusted R-squared:  0.05926 
F-statistic: 315.9 on 2 and 9997 DF,  p-value: < 2.2e-16
  1. Response Question How many variables does the regression say we have?

It says that we have 2 variables.

  1. Coding Question Calculate mean(interest_rate) for each level of the categorical variable income_ver

Mean interest rate for income_ver == “verified” is 14.35. Mean interest rate for income_ver == “source_only” is 12.52.

income_vv <- filter(loan, income_ver == "verified")
income_vs <- filter(loan, income_ver == "source_only")
mean(income_vv$interest_rate)
[1] 14.35375
mean(income_vs$interest_rate)
[1] 12.51548
  1. Response Question
  1. How do your values correspond to the coefficient values in the regression?

Mean interest rate for income_ver == “verified” is 14.35 and the intercept is 3.25429. Mean interest rate for income_ver == “source_only” is 12.52 and the intercept is 1.41602. Slope is 11.09946. If you think of income_ver == “verified” as x = 1 (1 = TRUE), then the y = mx + b equation finds that y = 14.35375, which rounds to 14.35 (mean interest rate for income_ver == “verified”). If you do the same for income_ver == “source_only” using the appropriate intercept, you get 12.51548 which rounds to 12.52 (mean interest rate for income_ver == “source_only”).

This clear relationship between the mean interest rates calculated with the summary and filter functions and the coefficients calculated with the lm function shows us how the equation parts fit together.

  1. Given what you know now why do you think you have two new variables related to income_ver?

Bankruptcy had only 2 responses: 1 and 0. Comparatively, income_ver has 3 responses: verified, not, & source_only. RStudio made “not” the reference variable and “verified” and “source_only” the 2 new variables.

  1. Do you get an extra variable in case of single variable regression? If so what is it?

You do not get an extra variable in case of single variable regression.

  1. How would you write a regression model for the regression of interest_rate on income_verified?

yhat = 11.09946 + (1.41602 * x_1) + (3.25429 * x_2)

In the above equation, x_1 refers to income_ver == “source_only” and x_2 refers to income_ver == “verified”.

Multiple Regression

Multiple regression means that we are regressing on a sum of variables. In fact when we regress on a categorical variable we are doing multiple regression! Why?

Since the loan data set gives us a lot of variables let’s try regression on all of them. (run the code below)

summary(lm(interest_rate ~ income_ver + debt_to_income + credit_util + bankruptcy + term + issued + credit_checks, data = loan))

Call:
lm(formula = interest_rate ~ income_ver + debt_to_income + credit_util + 
    bankruptcy + term + issued + credit_checks, data = loan)

Residuals:
     Min       1Q   Median       3Q      Max 
-19.9070  -3.4362  -0.7239   2.5397  18.0874 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)            3.969e+00  2.087e-01  19.016  < 2e-16 ***
income_versource_only  1.083e+00  1.036e-01  10.456  < 2e-16 ***
income_ververified     2.482e+00  1.223e-01  20.293  < 2e-16 ***
debt_to_income         3.787e-02  3.121e-03  12.137  < 2e-16 ***
credit_util           -4.323e-06  8.733e-07  -4.950 7.54e-07 ***
bankruptcy             5.043e-01  1.383e-01   3.645 0.000269 ***
term                   1.495e-01  4.123e-03  36.257  < 2e-16 ***
issuedJan2018         -2.061e-02  1.128e-01  -0.183 0.854969    
issuedMar2018         -9.280e-02  1.112e-01  -0.834 0.404037    
credit_checks          2.233e-01  1.917e-02  11.647  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.489 on 9966 degrees of freedom
  (24 observations deleted due to missingness)
Multiple R-squared:  0.1942,    Adjusted R-squared:  0.1934 
F-statistic: 266.8 on 9 and 9966 DF,  p-value: < 2.2e-16
  1. Response Question
  1. Comment on the \(p\)-values for the variables, noting anything interesting.

The p-values vary widely with 2 being pretty close to 1 and many being very far away from 1. The 2 p-values close to 1 are the only ones that aren’t very significant. That means that the month/year the loan was issued won’t have much of a statistical impact on the model.

  1. Comment on the difference between \(R^2\) and adj-\(R^2\). Which do you think is more variable. Do you think the value of either \(R^2\) is too low to make this model useful?

The adjusted R^2 is more variable because it changes as you remove or add variables.

  1. How would you suggest making the model better?

We can take out issued since it made variables that weren’t very statistically significant. You could also take out bankruptcy since, after eliminating issued, it is the least statistically significant. However, doing that made the adjusted R^2 go down so adding it back in is probably best.

summary(lm(interest_rate ~ income_ver + debt_to_income + credit_util + term + credit_checks, data = loan))

Call:
lm(formula = interest_rate ~ income_ver + debt_to_income + credit_util + 
    term + credit_checks, data = loan)

Residuals:
     Min       1Q   Median       3Q      Max 
-20.0667  -3.4379  -0.6904   2.5357  18.0371 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)            3.980e+00  1.956e-01  20.344  < 2e-16 ***
income_versource_only  1.084e+00  1.036e-01  10.465  < 2e-16 ***
income_ververified     2.481e+00  1.223e-01  20.279  < 2e-16 ***
debt_to_income         3.818e-02  3.121e-03  12.234  < 2e-16 ***
credit_util           -4.554e-06  8.714e-07  -5.226 1.76e-07 ***
term                   1.496e-01  4.125e-03  36.265  < 2e-16 ***
credit_checks          2.295e-01  1.911e-02  12.010  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.491 on 9969 degrees of freedom
  (24 observations deleted due to missingness)
Multiple R-squared:  0.193, Adjusted R-squared:  0.1925 
F-statistic: 397.4 on 6 and 9969 DF,  p-value: < 2.2e-16
summary(lm(interest_rate ~ income_ver + debt_to_income + term + credit_checks, data = loan))

Call:
lm(formula = interest_rate ~ income_ver + debt_to_income + term + 
    credit_checks, data = loan)

Residuals:
     Min       1Q   Median       3Q      Max 
-18.5036  -3.4305  -0.6827   2.5546  17.8843 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)            3.91796    0.19552   20.04   <2e-16 ***
income_versource_only  1.06664    0.10365   10.29   <2e-16 ***
income_ververified     2.46688    0.12246   20.14   <2e-16 ***
debt_to_income         0.03440    0.00304   11.32   <2e-16 ***
term                   0.14807    0.00412   35.94   <2e-16 ***
credit_checks          0.21818    0.01901   11.48   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.497 on 9970 degrees of freedom
  (24 observations deleted due to missingness)
Multiple R-squared:  0.1908,    Adjusted R-squared:  0.1904 
F-statistic: 470.2 on 5 and 9970 DF,  p-value: < 2.2e-16

Model Selection

The multi-variate model we have just constructed is not necessarily the best model.

Why might this be the case?

How might we try to improve things?

Identifying variables that might not be important.

Let’s look at the coefficient estimates for the full model (the one with all the variables)

coef(summary(lm(interest_rate ~ income_ver + debt_to_income + credit_util + bankruptcy + term + issued + credit_checks, data = loan)))
                           Estimate   Std. Error    t value      Pr(>|t|)
(Intercept)            3.969028e+00 2.087232e-01 19.0157518  3.163114e-79
income_versource_only  1.082767e+00 1.035565e-01 10.4558002  1.866232e-25
income_ververified     2.482072e+00 1.223104e-01 20.2932296  9.462920e-90
debt_to_income         3.787448e-02 3.120564e-03 12.1370626  1.160422e-33
credit_util           -4.323176e-06 8.733479e-07 -4.9501194  7.538276e-07
bankruptcy             5.042764e-01 1.383482e-01  3.6449793  2.687739e-04
term                   1.495000e-01 4.123314e-03 36.2572457 1.696856e-270
issuedJan2018         -2.060967e-02 1.127529e-01 -0.1827863  8.549694e-01
issuedMar2018         -9.279620e-02 1.112041e-01 -0.8344677  4.040375e-01
credit_checks          2.232706e-01 1.916941e-02 11.6472314  3.770754e-31
  1. Response Question Look at the \(p\)-values. What are the largest and what are the smallest? Are any of the variable estimates not significant at the 95% level?

The largest p-values belong to “issuedJan2018” & “issuedMar2018”. These 2 came from the loan dataset’s “issued” column that noted whether the loans were issued in January, February, or March 2018. The variables are categorical. Earlier we suggested taking out the “issued” variables because they were not statistically significant and that is still true here, as evidenced by the p-values being larger and closer to 1.

Comparatively, the “term” variable has an extremely small p-value. This is a numeric and discrete variable since the values can only be 36 and 60. Since the p-value is the smallest, we can say that it has a lot of statistical significance, meaning that the value of term (length of loan in months) has a large impact on the interest rate.

Thinking about it, the idea that the length of the loan has a large impact on the interest rate makes more sense than the idea that the issue data has a large impact on the interest rate. This logic confirms my thinking on the p-values.

  1. Coding Question Rerun the regression leaving out the non-significant variables.
coef(summary(lm(interest_rate ~ income_ver + debt_to_income + credit_util + bankruptcy + term + credit_checks, data = loan)))
                           Estimate   Std. Error   t value      Pr(>|t|)
(Intercept)            3.928358e+00 1.960052e-01 20.042113  1.241835e-87
income_versource_only  1.080223e+00 1.035067e-01 10.436255  2.287914e-25
income_ververified     2.479061e+00 1.222547e-01 20.277831  1.277232e-89
debt_to_income         3.790429e-02 3.120197e-03 12.148042  1.016557e-33
credit_util           -4.319428e-06 8.732526e-07 -4.946367  7.684602e-07
bankruptcy             5.054572e-01 1.383325e-01  3.653931  2.595819e-04
term                   1.495189e-01 4.122211e-03 36.271519 1.065862e-270
credit_checks          2.233441e-01 1.916791e-02 11.651979  3.568771e-31
summary(lm(interest_rate ~ income_ver + debt_to_income + credit_util + bankruptcy + term + credit_checks, data = loan))

Call:
lm(formula = interest_rate ~ income_ver + debt_to_income + credit_util + 
    bankruptcy + term + credit_checks, data = loan)

Residuals:
     Min       1Q   Median       3Q      Max 
-19.8988  -3.4316  -0.7204   2.5421  18.1083 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)            3.928e+00  1.960e-01  20.042  < 2e-16 ***
income_versource_only  1.080e+00  1.035e-01  10.436  < 2e-16 ***
income_ververified     2.479e+00  1.223e-01  20.278  < 2e-16 ***
debt_to_income         3.790e-02  3.120e-03  12.148  < 2e-16 ***
credit_util           -4.319e-06  8.733e-07  -4.946 7.68e-07 ***
bankruptcy             5.055e-01  1.383e-01   3.654  0.00026 ***
term                   1.495e-01  4.122e-03  36.272  < 2e-16 ***
credit_checks          2.233e-01  1.917e-02  11.652  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.488 on 9968 degrees of freedom
  (24 observations deleted due to missingness)
Multiple R-squared:  0.1941,    Adjusted R-squared:  0.1935 
F-statistic: 342.9 on 7 and 9968 DF,  p-value: < 2.2e-16
  1. Response Questions

Compare this more parsimonious model with the full model. What are your observations?

While the p-values still have a large range, they are all far from 1 and seem statistically significant. Now “bankruptcy” has the smallest p-value. However, in question 3 I found that eliminating “bankruptcy” made the R^2 go down so I opted not to remove it here.

If you were to delete a variable from this model, which one would you delete?

If I had to delete a variable it would be “bankruptcy”. However, even if I hadn’t previously tried deleting “bankruptcy” before (as I explained in the previous question), I still wouldn’t feel completely comfortable deleting “bankruptcy” since it seems significant.

  1. Code Question Estimate the model you just proposed.
coef(summary(lm(interest_rate ~ income_ver + debt_to_income + credit_util + term + credit_checks, data = loan)))
                           Estimate   Std. Error   t value      Pr(>|t|)
(Intercept)            3.979778e+00 1.956204e-01 20.344393  3.474765e-90
income_versource_only  1.083822e+00 1.035661e-01 10.465020  1.694816e-25
income_ververified     2.480773e+00 1.223296e-01 20.279428  1.237520e-89
debt_to_income         3.818449e-02 3.121187e-03 12.233967  3.596550e-34
credit_util           -4.554409e-06 8.714208e-07 -5.226418  1.763213e-07
term                   1.495828e-01 4.124727e-03 36.264891 1.313362e-270
credit_checks          2.294695e-01 1.910629e-02 12.010160  5.306514e-33
summary(lm(interest_rate ~ income_ver + debt_to_income + credit_util + term + credit_checks, data = loan))

Call:
lm(formula = interest_rate ~ income_ver + debt_to_income + credit_util + 
    term + credit_checks, data = loan)

Residuals:
     Min       1Q   Median       3Q      Max 
-20.0667  -3.4379  -0.6904   2.5357  18.0371 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)            3.980e+00  1.956e-01  20.344  < 2e-16 ***
income_versource_only  1.084e+00  1.036e-01  10.465  < 2e-16 ***
income_ververified     2.481e+00  1.223e-01  20.279  < 2e-16 ***
debt_to_income         3.818e-02  3.121e-03  12.234  < 2e-16 ***
credit_util           -4.554e-06  8.714e-07  -5.226 1.76e-07 ***
term                   1.496e-01  4.125e-03  36.265  < 2e-16 ***
credit_checks          2.295e-01  1.911e-02  12.010  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.491 on 9969 degrees of freedom
  (24 observations deleted due to missingness)
Multiple R-squared:  0.193, Adjusted R-squared:  0.1925 
F-statistic: 397.4 on 6 and 9969 DF,  p-value: < 2.2e-16
  1. Response Question

What are your observations?

As I said earlier, the adjusted R^2 did go down which is not ideal. You can go round and round eliminating the lowest p-value each time but that can mean eliminating variables with statistical significance. While it doesn’t make sense that previous bankruptcies would have the most impact on interest rate, it makes sense that previous bankruptcies would have some impact on interest rate and more of an impact than the issue date. Therefore, the model is less useful when we take bankruptcy out.

---
title: "Tutorial Part 1"
author: "Tara Bhat"
date: "August 5, 2021"
output: html_notebook
---

```{r}
library(tidyverse)
library(openintro)
```

## Introduction

This workbook is an in class assignment. After completion you will have a good practical knowledge of Multiple Regression and Logistic Regression.

There are two types of exercises in this Tutorial.

*Response questions* These questions are thought exercises and require a written response.

**Response Question Example:**

What is the Capital of North Dakota?

Response: Bismarck


*Coding questions* These questions require entering R code in the provided chunck.

**Coding question example**

Write the R code to calculate the average of the whole numbers from 1 to 10,
enter your code here:

```{r}
mean(1:10)
```


## Section 1 Loans

We will consider data about loans from the peer-to-peer lender, Lending Club. The data is contained in the `loan.csv` file contained in the project directory.


(@) **Coding Question 1**
load the data into a new tibble called `loan`

```{r}
loan <- read_csv("loan.csv")
```


You may find the following variable dictionary useful.

Variable dictionary for `loan`.
===========================================
  **variable**        **description**
--------------        ---------------------
`interest_rate`       Interest rate for the loan
`income_ver`          Categorical variable describing whether the borrower's income source and amount have been verified, levels `verified`,`source_only`, `not`
`dept_to_income`      Debt-to-income ratio, which is the percent of total debt of the borrower divided by their total income.
`credit_util`           Of all the Credit available to the borrower, what fraction are they using. For example the credit utilization on the credit card would be the card's balance divided by the card's credit limit
`bankruptcy`           An indicator variable for whether the borrower has a past bankruptcy in her record. This variable takes a value of 1 if the answer is "yes" and 0 if the answer is "no".
`term`                  The length of the loan, in months.
`issued`                The month and year the loan was issued.
`credit_checks`         Number of credit checks in the last 12 months. For example, when filing an application for a credit cards, it is common for the company receiving the application to run a credit check.
--------------------------------------------


## Using Categorical Variables as Predictors

Recall the single variable models we have been studying. The prediction looks like this.
$$ \hat{Y} = \hat{\beta_0} + \hat{\beta_1}X$$
Where the variables denoted by the $\hat{\,}$ over them are the estimates obtained by `lm`.

In every case we have, more or less implicitly, assumed that X is a numeric variable. Would this make any sense if $X$ is a categorical variable?

(@) **Response Question**
What is a categorical variable?

A variable is categorical if the responses are categories. The responses are not numeric and they are not continuous. If you use the summary function on the data, you will not see a min, max, median, mean, 1st quartile, and 3rd quartile for categorical variables.


(@) **Coding Question**  
Consider the variable `income_ver`, what are the possible values of `income_ver` and what is the proportion of all records in each category of `income_ver`? Use code to find this, do not use the above dictionary, it might be wrong or out of date.



I can see by using summary function that income_ver is a categorical variable because it doesn't have the min/max/median/mean output. I can also do the view function and see for myself that income_ver looks like a categorical variable.



```{r Question 3 code}
view(loan)
summary(loan)
```


So if we are interested in determining the variation of `interest_rate` as a function of `income_ver`. What does `lm`  mean for problems like this?

To understand how to handle categorical variables, we will start with special type of categorical variable called an indicator variable. We have such a variable in the `loan` data set, it's called `bankruptcy` and it takes the value 0 if the applicant has had no previous bankruptcies and 1 if the applicant has had at least one previous bankruptcy.

(@) **Coding Question**
Find the proportion of applicants with no bankruptcy and the proportion of applicants who have had at least one bankruptcy.


```{r Question 4 code}
count(loan, bankruptcy == 1)
count(loan, bankruptcy == 0)
```

The bankruptcy variable is set up such that it says 1 if there has been a past bankruptcy and 0 if there hasn't been one. I can use the count function to find out how many times bankruptcy equals 1 (or how many times "bankruptcy = 1" is true) and do the same for 0. The number of FALSE for bankruptcy equals 1 is the same number as the number of TRUE for bankruptcy equals 0 which proves that there are only 1 and 0 responses in the bankruptcy column.


(@) **Response Questions**
What is the difference between the variables `income_ver` and `bankruptcy`?


Bankruptcy is an example of numeric and discrete variable. While the responses are numbers, there are limited options (only 1 and 0). A numeric and continuous variable would not have such restrictions on the numeric values. Income_ver is a categorical variable and does not have numeric responses like bankruptcy does.



Given a value $x$ of `bankruptcy` does it makes sense to multiply $x$ by a number, e.g., $2.3*x$? *Hint: consider the possible values of $x$*


I do not think it makes sense to modify the bankruptcy value with multiplication because it is simpler and more straightforward to have the 1 represent past bankruptcies and 0 represent no past bankruptcies. Additionally, the multiplication would just change the 1 into 2.3 (or another value) and not change the 0 at all, so multiplication wouldn't make a significant change.

(@) **Coding Question**

Summarize the results of regressing `interest_rate` on `bankruptcy` using `lm` 


```{r Question 6 code}
fit_ir <- lm(interest_rate ~ bankruptcy, data = loan)
summary(fit_ir)
```

(@) **Response Question**

Interpret the results of this regression.


The slope coefficient is 0.7368 which is what we expect the model to adjust the interest rate by if bankruptcy = 1. The intercept 12.338 tells us what the interest rate would be if x (bankruptcy) is 0.


(@) **Coding Question**
What is the average interest rate for applicants who have had no bankruptcies?


The average interest rate for applicants that have had no bankruptcies is 12.338. Comparatively, the mean interest rate for the entire dataset (where bankruptcy = 1 or 0) is 12.43.

```{r Question 8 code}
ir_b <- filter(loan, bankruptcy == 0)
mean(ir_b$interest_rate)
```

(@) **Response Question**
How does this value compare with the estimated intercept?


The average interest rate for applicants that have had no bankruptcies is 12.338. The intercept is 12.338 too. 


How can you explain this value?


The intercept is supposed to be the average value of y if x = 0. Therefore if x is bankruptcy and y is interest rate, this value makes sense.

(@) **Coding Question**

What is the average interest rate for people who have had at least one bankruptcy?


The average interest rate for people who have had at least one bankruptcy is 13.07.

```{r}
ir_b1 <- filter(loan, bankruptcy == 1)
mean(ir_b1$interest_rate)
```


1. Could you have determined this value from the regression summary? If so, how?


Yes. You can solve for y using x = 1 (at least 1 bankruptcy) and use the slope and intercept. The answer is 13.0748 which rounds to 13.07.


1. How do you interpret the meaning of slope in this context?


The slope is the part of the model that is adjusted based on the value of x. Both the slope and the intercept account for how y varies as x varies.


1. Which is more important the estimated slope or $R^2$?


I am not sure whether one is more important than the other since one can be used to help find the other. However value R can tell you more about correlation and the strength of the linear trend so R^2 may be more useful.


#### Now for Categorical variables in general.

(@) **Coding Question** 
Print the summary of regressing `interest_rate` on `income_ver`
```{r}
fit_iv <- lm(interest_rate ~ income_ver, data = loan)
summary(fit_iv)
```

(@) **Response Question**
How many variables does the regression say we have?


It says that we have 2 variables. 

(@) **Coding Question**
Calculate  `mean(interest_rate)` for each level of the categorical variable `income_ver`


Mean interest rate for income_ver == "verified" is 14.35. Mean interest rate for income_ver == "source_only" is 12.52.
```{r}
income_vv <- filter(loan, income_ver == "verified")
income_vs <- filter(loan, income_ver == "source_only")
mean(income_vv$interest_rate)
mean(income_vs$interest_rate)
```

(@) **Response Question**
1. How do your values correspond to the coefficient values in the regression?


Mean interest rate for income_ver == "verified" is 14.35 and the intercept is 3.25429. Mean interest rate for income_ver == "source_only" is 12.52 and the intercept is 1.41602. Slope is 11.09946. If you think of income_ver == "verified" as x = 1 (1 = TRUE), then the y = mx + b equation finds that y = 14.35375, which rounds to 14.35 (mean interest rate for income_ver == "verified"). If you do the same for income_ver == "source_only" using the appropriate intercept, you get 12.51548 which rounds to 12.52 (mean interest rate for income_ver == "source_only"). 

This clear relationship between the mean interest rates calculated with the summary and filter functions and the coefficients calculated with the lm function shows us how the equation parts fit together.


2. Given what you know now why do you think you have two new variables related to `income_ver`?


Bankruptcy had only 2 responses: 1 and 0. Comparatively, income_ver has 3 responses: verified, not, & source_only. RStudio made "not" the reference variable and "verified" and "source_only" the 2 new variables. 


3. Do you get an extra variable in case of single variable regression? If so what is it?


You do not get an extra variable in case of single variable regression.


4. How would you write a regression model for the regression of `interest_rate` on `income_verified`? 


yhat = 11.09946 + (1.41602 * x_1) + (3.25429 * x_2)

In the above equation, x_1 refers to income_ver == "source_only" and x_2 refers to income_ver == "verified".


### Multiple Regression

Multiple regression means that we are regressing on a sum of variables. In fact when we regress on a categorical variable we are doing multiple regression! Why?

Since the loan data set gives us a lot of variables let's try regression on all of them. (run the code below)
```{r}
summary(lm(interest_rate ~ income_ver + debt_to_income + credit_util + bankruptcy + term + issued + credit_checks, data = loan))
```
(@) **Response Question**

1. Comment on the $p$-values for the variables, noting anything interesting.


The p-values vary widely with 2 being pretty close to 1 and many being very far away from 1. The 2 p-values close to 1 are the only ones that aren't very significant. That means that the month/year the loan was issued won't have much of a statistical impact on the model. 


2. Comment on the difference between $R^2$ and adj-$R^2$. Which do you think is more variable. Do you think the value of either $R^2$ is too low to make this model useful?


The adjusted R^2 is more variable because it changes as you remove or add variables. 


3. How would you suggest making the model better?


We can take out issued since it made variables that weren't very statistically significant. You could also take out bankruptcy since, after eliminating issued, it is the least statistically significant. However, doing that made the adjusted R^2 go down so adding it back in is probably best. 

```{r}
summary(lm(interest_rate ~ income_ver + debt_to_income + credit_util + term + credit_checks, data = loan))
summary(lm(interest_rate ~ income_ver + debt_to_income + term + credit_checks, data = loan))
```


### Model Selection

The multi-variate model we have just constructed is not necessarily the best model. 

Why might this be the case?

How might we try to improve things?

#### Identifying variables that might not be important.
Let's look at the coefficient estimates for the **full** model (the one with all the variables)
```{r}
coef(summary(lm(interest_rate ~ income_ver + debt_to_income + credit_util + bankruptcy + term + issued + credit_checks, data = loan)))
```
(@) **Response Question**
Look at the $p$-values. What are the largest and what are the smallest? Are any of the variable estimates not significant at the 95% level?


The largest p-values belong to "issuedJan2018" & "issuedMar2018". These 2 came from the loan dataset's "issued" column that noted whether the loans were issued in January, February, or March 2018. The variables are categorical. Earlier we suggested taking out the "issued" variables because they were not statistically significant and that is still true here, as evidenced by the p-values being larger and closer to 1. 

Comparatively, the "term" variable has an extremely small p-value. This is a numeric and discrete variable since the values can only be 36 and 60. Since the p-value is the smallest, we can say that it has a lot of statistical significance, meaning that the value of term (length of loan in months) has a large impact on the interest rate. 

Thinking about it, the idea that the length of the loan has a large impact on the interest rate makes more sense than the idea that the issue data has a large impact on the interest rate. This logic confirms my thinking on the p-values.


(@) **Coding Question**
Rerun the regression leaving out the non-significant variables.
```{r}
coef(summary(lm(interest_rate ~ income_ver + debt_to_income + credit_util + bankruptcy + term + credit_checks, data = loan)))
summary(lm(interest_rate ~ income_ver + debt_to_income + credit_util + bankruptcy + term + credit_checks, data = loan))
```
(@) **Response Questions**

Compare this more parsimonious model with the full model. What are your observations?


While the p-values still have a large range, they are all far from 1 and seem statistically significant. Now "bankruptcy" has the smallest p-value. However, in question 3 I found that eliminating "bankruptcy" made the R^2 go down so I opted not to remove it here.


If you were to delete a variable from this model, which one would you delete?


If I had to delete a variable it would be "bankruptcy". However, even if I hadn't previously tried deleting "bankruptcy" before (as I explained in the previous question), I still wouldn't feel completely comfortable deleting "bankruptcy" since it seems significant. 


(@) **Code Question**
Estimate the model you just proposed.

```{r}
coef(summary(lm(interest_rate ~ income_ver + debt_to_income + credit_util + term + credit_checks, data = loan)))
summary(lm(interest_rate ~ income_ver + debt_to_income + credit_util + term + credit_checks, data = loan))
```

(@) **Response Question**

What are your observations?


As I said earlier, the adjusted R^2 did go down which is not ideal. You can go round and round eliminating the lowest p-value each time but that can mean eliminating variables with statistical significance. While it doesn't make sense that previous bankruptcies would have the most impact on interest rate, it makes sense that previous bankruptcies would have some impact on interest rate and more of an impact than the issue date. Therefore, the model is less useful when we take bankruptcy out. 