R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

This exercise is under construction. Please report any errors at https://forms.gle/2W4tffs4YJA1jeBv9

Goal: Understand and experience logistic regression to predict the probability of loan default (due to fraud or other reasons). Build skills and confidence to search for online help.

Background: The data for this question contains information about borrowers, loans, and the outcome (defaulted or paid). We are concerned about minimize loss, and not concerned whether the default was intentional or unintentional. I developed this assignment to walk you through the process because I couldn’t find any assignment at this level that can balance fundamentals and practical aspects. The data has been derived from has been adapted from https://campus.datacamp.com/courses/credit-risk-modeling-in-r/ (but my approach is quite different).

Before starting: 1. You are not allowed to: 1a. Search for solutions to this assignment 2b. Subcontract your assignment to someone else 2. You are allowed to: 2a. Search information about packages and functions you may use 2b. Consult with your team mates.

Individual assignment only: 179 total points (Rmd and html solution)

[1 point] Q1.

Start by entering your name and today’s date in Lines 3 and 4, respectively, to indicate your compliance with the Fuqua Honor Code. Then, run the chunk of code below by clicking on the green arrow (that points to the right) on the top right of the chunk. Tip: I numbered code chunks corresponding to their numbers. Chunk 1 specifies the knitting parameters.

[6 points] Q2.

Read and store the data from the file LoanData.rds into a variable called loanData. Then, inspect the data using 2 or more R commands. Tip: Use Google to learn about rds file format and how to read it into R. You’ll likely need some libraries and packages. Rubric: 3 each point for reading, 1 point for storing; 1 points each for using 2 R commands for inspecting.*

#install.packages("tidyverse")
library(dplyr)
## 
## 载入程辑包:'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(readr)
library(tidyverse)
## Warning: 程辑包'tidyverse'是用R版本4.3.2 来建造的
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Read the data from the RDS file
loanData <- readRDS("LoanData.rds")

# Inspect the structure of the data
str(loanData)
## 'data.frame':    29092 obs. of  8 variables:
##  $ isLoanDefault  : int  0 0 0 0 0 0 1 0 1 0 ...
##  $ loanAmount     : int  5000 2400 10000 5000 3000 12000 9000 3000 10000 1000 ...
##  $ interestRate   : num  10.6 NA 13.5 NA NA ...
##  $ creditGrade    : Factor w/ 7 levels "A","B","C","D",..: 2 3 3 1 5 2 3 2 2 4 ...
##  $ employmentYears: int  10 25 13 3 9 11 0 3 3 0 ...
##  $ homeLiving     : Factor w/ 4 levels "MORTGAGE","OTHER",..: 4 4 4 4 4 3 4 4 4 4 ...
##  $ incomeAnnual   : num  24000 12252 49200 36000 48000 ...
##  $ ageYears       : int  33 31 24 39 24 28 22 22 28 22 ...
# Display the first few rows of the data
head(loanData)
##   isLoanDefault loanAmount interestRate creditGrade employmentYears homeLiving
## 1             0       5000        10.65           B              10       RENT
## 2             0       2400           NA           C              25       RENT
## 3             0      10000        13.49           C              13       RENT
## 4             0       5000           NA           A               3       RENT
## 5             0       3000           NA           E               9       RENT
## 6             0      12000        12.69           B              11        OWN
##   incomeAnnual ageYears
## 1        24000       33
## 2        12252       31
## 3        49200       24
## 4        36000       39
## 5        48000       24
## 6        75000       28

[4 points] Q3.

I haven’t provided you the data dictionary but the columns have descriptive names so you can easily interpret them. Perhaps, homeLiving and creditGrade are unclear. Figure what homeLiving and creditGrade represent by examining the the summary. Rubric: 2 points for each (homeLiving and creditGrade).

### This section doesn't require code. Just answer below (outside) the code block.
#homeLiving refers to the state of the house in which the borrower resides including the following four types of MORTGAGE, OTHER, OWN, and RENT. 
#creditGrade refers to the credit rating of the borrower, generally, credit decreasing from A to G, A is the best credit, this credit rating is related to the interest rate, high credit rating has a lower interest rate.

[6 points] Q4.

Now, let’s make sure we understand credit grade using crosstables isLoanDefault and creditGrade. Tip: Follow these steps to make progress… 1. Find, install, and load the library with CrossTable function. 2. Call CrossTable() on creditGrade and isLoanDefault columns of loanData. Rubric: 4 points for finding, installing and loading the correct libraries; 2 points for calling CrossTable with the correct parameters.

#install.packages("gmodels")
library(gmodels)
## Warning: 程辑包'gmodels'是用R版本4.3.2 来建造的
CrossTable(loanData$creditGrade, loanData$isLoanDefault)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  29092 
## 
##  
##                      | loanData$isLoanDefault 
## loanData$creditGrade |         0 |         1 | Row Total | 
## ---------------------|-----------|-----------|-----------|
##                    A |      9084 |       565 |      9649 | 
##                      |    29.764 |   238.561 |           | 
##                      |     0.941 |     0.059 |     0.332 | 
##                      |     0.351 |     0.175 |           | 
##                      |     0.312 |     0.019 |           | 
## ---------------------|-----------|-----------|-----------|
##                    B |      8344 |       985 |      9329 | 
##                      |     0.299 |     2.398 |           | 
##                      |     0.894 |     0.106 |     0.321 | 
##                      |     0.323 |     0.305 |           | 
##                      |     0.287 |     0.034 |           | 
## ---------------------|-----------|-----------|-----------|
##                    C |      4904 |       844 |      5748 | 
##                      |     8.337 |    66.821 |           | 
##                      |     0.853 |     0.147 |     0.198 | 
##                      |     0.190 |     0.262 |           | 
##                      |     0.169 |     0.029 |           | 
## ---------------------|-----------|-----------|-----------|
##                    D |      2651 |       580 |      3231 | 
##                      |    17.096 |   137.024 |           | 
##                      |     0.820 |     0.180 |     0.111 | 
##                      |     0.102 |     0.180 |           | 
##                      |     0.091 |     0.020 |           | 
## ---------------------|-----------|-----------|-----------|
##                    E |       692 |       176 |       868 | 
##                      |     8.235 |    66.004 |           | 
##                      |     0.797 |     0.203 |     0.030 | 
##                      |     0.027 |     0.055 |           | 
##                      |     0.024 |     0.006 |           | 
## ---------------------|-----------|-----------|-----------|
##                    F |       155 |        56 |       211 | 
##                      |     5.663 |    45.394 |           | 
##                      |     0.735 |     0.265 |     0.007 | 
##                      |     0.006 |     0.017 |           | 
##                      |     0.005 |     0.002 |           | 
## ---------------------|-----------|-----------|-----------|
##                    G |        35 |        21 |        56 | 
##                      |     4.392 |    35.206 |           | 
##                      |     0.625 |     0.375 |     0.002 | 
##                      |     0.001 |     0.007 |           | 
##                      |     0.001 |     0.001 |           | 
## ---------------------|-----------|-----------|-----------|
##         Column Total |     25865 |      3227 |     29092 | 
##                      |     0.889 |     0.111 |           | 
## ---------------------|-----------|-----------|-----------|
## 
## 

[4 points] Q5.

Based on the above answers, deduce whether or not A is the best (highest) credit rating. Explain your reasoning. Tip: The third number in each box under the loanData$isLoanDefault column is the proportion of default rate. For example, the default rate for A was 0.059 (i.e., 5.9%). You do not need to understand CrossTable beyond that to answer this question. Rubric: 1 point for the correct answer, 3 points for the reasoning.

### This section doesn't require code. Just answer below (outside) the code block.
#We can see from crosstable that the default rate is increasing from A to G so I think A should be the best credit rating, because in general people with high credit are less likely to default than people with low credit.

Answer: The letter A is the best (highest) credit rating because it has the lowest default rate (0.059) and the default rate increases as the credit rating for each subsequent letter grade.

[12 points] Q6.

Let’s check for any outliers in the numeric/integer columns related to the borrowers (employmentYears, incomeAnnual, ageYears) in steps: 1. Examine the summary statistics to pick large relative gaps between the minimum and the 1st quartile as well as the maximum and the 3rd quartile. 2. Then, display a scatter plot of all combinations of two variables, one at a time. Tip: I prefer to visualize outliers using scatter plots as a combination of two variables, so I can get a better understanding of the outliers, such as a huge salary for a young person (in age or employment duration). Tip: library(ggplot2) and library(gridExtra) may be useful. Rubric: 3 points for each plot; 3 point for examining the summary data (on honor code).

library(ggplot2)
library(gridExtra)
## Warning: 程辑包'gridExtra'是用R版本4.3.2 来建造的
## 
## 载入程辑包:'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
summary(loanData[c("employmentYears", "incomeAnnual", "ageYears")])
##  employmentYears   incomeAnnual        ageYears    
##  Min.   : 0.000   Min.   :   4000   Min.   : 20.0  
##  1st Qu.: 2.000   1st Qu.:  40000   1st Qu.: 23.0  
##  Median : 4.000   Median :  56424   Median : 26.0  
##  Mean   : 6.145   Mean   :  67169   Mean   : 27.7  
##  3rd Qu.: 8.000   3rd Qu.:  80000   3rd Qu.: 30.0  
##  Max.   :62.000   Max.   :6000000   Max.   :144.0  
##  NA's   :809
plot1 <- ggplot(loanData, aes(x = ageYears, y = incomeAnnual, color = isLoanDefault)) +
  geom_point() +
  geom_hline(yintercept = quantile(loanData$incomeAnnual, 0.75) + 1.5 * IQR(loanData$incomeAnnual),
             linetype = "dashed", color = "red", size = 1) +
  geom_smooth(method = "lm", se = FALSE, color = "blue", linetype = "solid") +
  labs(title = "Scatter Plot: age vs income",
       subtitle = "Red dashed line indicates potential outliers",
       x = "Age (years)", y = "Income Annual")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
plot2 <- ggplot(loanData, aes(x = employmentYears, y = incomeAnnual, color = isLoanDefault)) +
  geom_point() +
  geom_hline(yintercept = quantile(loanData$incomeAnnual, 0.75) + 1.5 * IQR(loanData$incomeAnnual),
             linetype = "dashed", color = "red", size = 1) +
  geom_smooth(method = "lm", se = FALSE, color = "blue", linetype = "solid") +
  labs(title = "Scatter Plot: employment vs income",
       subtitle = "Red dashed line indicates potential outliers",
       x = "Employment (Years)", y = "Income Annual")


plot3 <- ggplot(loanData, aes(x = ageYears, y = employmentYears, color = isLoanDefault)) +
  geom_point() +
  geom_hline(yintercept = quantile(loanData$employmentYears, 0.75, na.rm = TRUE) + 1.5 * IQR(loanData$employmentYears, na.rm = TRUE),
             linetype = "dashed", color = "red", size = 1) +
  geom_smooth(method = "lm", se = FALSE, color = "blue", linetype = "solid") +
  labs(title = "Scatter Plot: age vs employment",
       subtitle = "Red dashed line indicates potential outliers",
       x = "Age (years)", y = "Employment (Years)")


grid.arrange(plot1, plot2, plot3, ncol = 2)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 809 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 809 rows containing missing values (`geom_point()`).
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 809 rows containing non-finite values (`stat_smooth()`).
## Removed 809 rows containing missing values (`geom_point()`).

[6 points] Q7.

Now, identify and remove any outliers and store the result in loanDataNoOutliers. Then, examine loanDataNoOutliers to verify that your code worked. Tip: I only removed one row. The others may be useful for our model. Tip: The structure of loanData and loanDataNoOutliers should be identical expect the number of rows be different (because you removed one or more rows). Rubric: 1 points for each column; 2 point for removing, 1 point for checking.

outliers <- loanData$ageYears > 120

outlier_rows <- loanData[outliers, ]
print(outlier_rows)
##       isLoanDefault loanAmount interestRate creditGrade employmentYears
## 19486             0       5000        12.73           C              12
##       homeLiving incomeAnnual ageYears
## 19486   MORTGAGE        6e+06      144
loanDataNoOutliers <- loanData[!outliers, ]

str(loanDataNoOutliers)
## 'data.frame':    29091 obs. of  8 variables:
##  $ isLoanDefault  : int  0 0 0 0 0 0 1 0 1 0 ...
##  $ loanAmount     : int  5000 2400 10000 5000 3000 12000 9000 3000 10000 1000 ...
##  $ interestRate   : num  10.6 NA 13.5 NA NA ...
##  $ creditGrade    : Factor w/ 7 levels "A","B","C","D",..: 2 3 3 1 5 2 3 2 2 4 ...
##  $ employmentYears: int  10 25 13 3 9 11 0 3 3 0 ...
##  $ homeLiving     : Factor w/ 4 levels "MORTGAGE","OTHER",..: 4 4 4 4 4 3 4 4 4 4 ...
##  $ incomeAnnual   : num  24000 12252 49200 36000 48000 ...
##  $ ageYears       : int  33 31 24 39 24 28 22 22 28 22 ...

[3 points] Q8.

Should you report any removed row(s) for further investigation?

### This section doesn't require code. Just answer below (outside) the code block.
#Yes, I think reporting deleted rows is critical for transparency and documentation in data analysis. It helps assess data quality, supports reproducibility, and allows for sensitivity analysis. By documenting the criteria and characteristics of deleted rows, others can understand the decision-making process, ensure ethical considerations, and provide insights into potential data problems. This practice increases the credibility of the analysis and promotes trust in the data pre-processing steps. In this case, I think the limit of human lifespan is 120 so I think anything above 120 is an outlier, so I'll delete the line where ageYears is equal to 144

[4 points] Q9.

Why do we store the the loanData in a new variable loanDataNoOutliers? Can you imagine a scenario when you would be better off not storing the result? Rubric: 2 points for each answer.

### This section doesn't require code. Just answer below (outside) the code block.
#Creating a new variable (e.g., loanDataNoOutliers) preserves the original dataset (loanData) for data integrity and reproducibility. Directly modifying the original dataset can hinder analysis reproduction. The new variable serves as documentation, clarifying specific operations (e.g., removing outliers). This aids in understanding and reproducing the workflow. Retaining both datasets facilitates easy comparison, enabling analysis of data characteristics before and after preprocessing.
#I think that in cases where memory is a significant limitation, creating a new variable may consume additional resources, at which point I would choose to modify the original dataset directly.

[9 points] Q10.

Now, let’s handle missing data by: 1. Make a copy of loanDataNoOutliers and call it loanDataNoOutliersNA. Then, conduct the following operations on loanDataNoOutliersNA. 2. Compute the percentage missing (NA) values in each column. 3. Add a new column for each variable that has missing values, as missing data can be an fraud indicator. The value of this column is 1 to indicate rows with missing values, and 0 otherwise. 4. Replace the missing values with the median value of the column if there are fewer than 10% missing values. 5. Delete the entire column if there are 10% or more missing values. Tip: mean(is.na(dfName\(colName))*100 gives you the percentage of missing values in colName of dfName. *Tip:* median(dfName\)colName, na.rm = TRUE) gives you the median value (ignoring all NAs). Tip: The summary should be similar except the added columns and imputed data. Tip: In real life, I would have asked you to use better imputation methods than just replacing by median. Rubric: 0.5 points for checking each column; 4 point for handling NAs, 1 point for checking.

loanDataNoOutliersNA <- loanDataNoOutliers

missing_percentage <- colMeans(is.na(loanDataNoOutliersNA)) * 100


for (col in names(missing_percentage[missing_percentage > 0])) {
  loanDataNoOutliersNA[paste0(col, "_missing")] <- as.integer(is.na(loanDataNoOutliersNA[, col]))
}


for (col in names(missing_percentage[missing_percentage > 0 & missing_percentage < 10])) {
  median_value <- median(loanDataNoOutliersNA[, col], na.rm = TRUE)
  loanDataNoOutliersNA[, col][is.na(loanDataNoOutliersNA[, col])] <- median_value
}


loanDataNoOutliersNA <- loanDataNoOutliersNA[, missing_percentage < 10]


summary_output <- c()
for (col in colnames(loanDataNoOutliersNA)) {
  if (col %in% names(missing_percentage[missing_percentage > 0])) {
    summary_output <- c(summary_output, paste0(missing_percentage[col], "%", col, " values are missing."))
  } else {
    summary_output <- c(summary_output, paste0("0 %", col, " values are missing."))
  }
}
summary_output
##  [1] "0 %isLoanDefault values are missing."                
##  [2] "0 %loanAmount values are missing."                   
##  [3] "9.54247017978069%interestRate values are missing."   
##  [4] "0 %creditGrade values are missing."                  
##  [5] "2.78092880959747%employmentYears values are missing."
##  [6] "0 %homeLiving values are missing."                   
##  [7] "0 %incomeAnnual values are missing."                 
##  [8] "0 %ageYears values are missing."                     
##  [9] "0 %interestRate_missing values are missing."         
## [10] "0 %employmentYears_missing values are missing."
summary(loanDataNoOutliersNA)
##  isLoanDefault      loanAmount     interestRate   creditGrade employmentYears 
##  Min.   :0.0000   Min.   :  500   Min.   : 5.42   A:9649      Min.   : 0.000  
##  1st Qu.:0.0000   1st Qu.: 5000   1st Qu.: 8.49   B:9329      1st Qu.: 2.000  
##  Median :0.0000   Median : 8000   Median :10.99   C:5747      Median : 4.000  
##  Mean   :0.1109   Mean   : 9594   Mean   :11.00   D:3231      Mean   : 6.085  
##  3rd Qu.:0.0000   3rd Qu.:12250   3rd Qu.:13.11   E: 868      3rd Qu.: 8.000  
##  Max.   :1.0000   Max.   :35000   Max.   :23.22   F: 211      Max.   :62.000  
##                                                   G:  56                      
##     homeLiving     incomeAnnual        ageYears    interestRate_missing
##  MORTGAGE:12001   Min.   :   4000   Min.   :20.0   Min.   :0.00000     
##  OTHER   :   97   1st Qu.:  40000   1st Qu.:23.0   1st Qu.:0.00000     
##  OWN     : 2301   Median :  56400   Median :26.0   Median :0.00000     
##  RENT    :14692   Mean   :  66965   Mean   :27.7   Mean   :0.09542     
##                   3rd Qu.:  80000   3rd Qu.:30.0   3rd Qu.:0.00000     
##                   Max.   :2039784   Max.   :94.0   Max.   :1.00000     
##                                                                        
##  employmentYears_missing
##  Min.   :0.00000        
##  1st Qu.:0.00000        
##  Median :0.00000        
##  Mean   :0.02781        
##  3rd Qu.:0.00000        
##  Max.   :1.00000        
## 

[9 points] Q11.

We should consider some of our features (aka variables and columns) for categorization because they don’t have a clear relationship with the target variable. For example, a higher interest rate may cause higher default due to financial burden while a lower interest rate may cause a higher default due to complacency (or procrastination). Which other variables should you consider for categorization? Rubric: 1 points for each feature (isLoanDefault is not a feature)

### This section doesn't require code. Just answer below (outside) the code block.
#Because the isLoanDefault is binary(0,1), so I think use more nominal variables may help to improve the model
#ageYears into age groups like("Young"<25, 25<="Middle-Aged"<50, 50<="elder")
#incomeAnnual into income brackets like("Low"<40000, 40000<="Medium"<80000, 80000<="High")
#employmentYears into entry <= 5, 5 < mid-level <= 20, and 20 < senior
#interestRate into low <= 8.5, 8.5 < medium  <= 13, and 13 < high
#creditGrade into high = A ,mid = b ,low = (C,D,E,F,G)
#loanAmount into low <= 5000 ,5000< mid <=8500 ,8500<high
#homeLiving is alright nominal,so I don't consider for categorization
#interestRate_missing and employmentYears_missing are binary variables so I don't consider for categorization.

[10 points] Q12.

Before running regression, separate loanDataNoOutliersNA into training data (dataframe loanTrain) and test data (dataframe loanTest) using the sample() function. Start by set.seed(2020) so we can reproduce the randomization for multiple runs (see http://rfunction.com/archives/62 to learn more). The dataframe loanTest should contains 1/3rds of loanDataNoOutliersNA’s rows. The dataframe loanTrain should contains the other 2/3rds of loanDataNoOutliersNA’s rows. Then, inspect the structure of loanTrain and loanTest to verify that your code works. Tip: Follow the following steps: 1. set.seed(2020) 2. Use sample() to generate loanTestIndices as a sample of 1/3 of loanDataNoOutliersNA’s indices for testing. The remaining 2/3 will be used to train. (Note: Most people generate the 2/3 indices for training but the optimizer in me prefers to assign 1/3 as much work to my computer - this is a byproduct of working in real-time systems where every nanosecond counts). 3. Use these statements like the 2 statements below to split between test and training data. loanTest = loanDataNoOutliersNA[loanTestIndices, ] and loanTrain = loanDataNoOutliersNA[-loanTestIndices, ] Rubric: 1 points for set.seed(), 3 points for correctly sampling, 4 points (2 each) for constructing loanTest and loanTrain, 2 point for verifying.

set.seed(2020)


loanTestIndices <- sample(nrow(loanDataNoOutliersNA), size = round(1/3 * nrow(loanDataNoOutliersNA)))


loanTest <- loanDataNoOutliersNA[loanTestIndices, ]
loanTrain <- loanDataNoOutliersNA[-loanTestIndices, ]


str(loanTrain)
## 'data.frame':    19394 obs. of  10 variables:
##  $ isLoanDefault          : int  0 0 0 1 0 1 0 0 0 0 ...
##  $ loanAmount             : int  5000 2400 10000 9000 3000 10000 10000 6000 10000 10000 ...
##  $ interestRate           : num  10.65 10.99 13.49 13.49 9.91 ...
##  $ creditGrade            : Factor w/ 7 levels "A","B","C","D",..: 2 3 3 3 2 2 3 2 2 2 ...
##  $ employmentYears        : num  10 25 13 0 3 3 4 1 13 5 ...
##  $ homeLiving             : Factor w/ 4 levels "MORTGAGE","OTHER",..: 4 4 4 4 4 4 4 1 3 4 ...
##  $ incomeAnnual           : num  24000 12252 49200 30000 15000 ...
##  $ ageYears               : int  33 31 24 22 22 28 23 30 23 22 ...
##  $ interestRate_missing   : int  0 1 0 0 0 0 1 1 0 0 ...
##  $ employmentYears_missing: int  0 0 0 0 0 0 0 0 0 0 ...
str(loanTest)
## 'data.frame':    9697 obs. of  10 variables:
##  $ isLoanDefault          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ loanAmount             : int  6700 14900 14000 10000 12000 9450 6000 7400 4600 8500 ...
##  $ interestRate           : num  11.99 9.63 10.99 14.79 11.71 ...
##  $ creditGrade            : Factor w/ 7 levels "A","B","C","D",..: 2 1 2 3 2 2 3 3 2 1 ...
##  $ employmentYears        : num  3 1 13 1 9 9 2 0 1 21 ...
##  $ homeLiving             : Factor w/ 4 levels "MORTGAGE","OTHER",..: 4 4 1 1 4 1 4 3 4 1 ...
##  $ incomeAnnual           : num  48000 63000 310000 100000 38340 ...
##  $ ageYears               : int  22 37 30 29 25 35 38 26 23 24 ...
##  $ interestRate_missing   : int  0 0 1 0 0 0 0 1 0 0 ...
##  $ employmentYears_missing: int  0 0 0 0 0 0 0 0 0 0 ...

[6 points] Q13.

Run a logistic regression on loanTrain data and store the result in glmBase. Then, print the summary of glmBase. Tip: Your call will look something like the one below. glmBase = glm(target ~ ., family = “binomial”, data = trainingdata) Note: You will need to replace your call with the appropriate variables. (The . indicates that we will consider all features in the regression equation.) Tip: I realize that some people would consider categorization at this stage. I like to learn more by running a regression now before considering categorizations. (Notice that we don’t need to consider categorizing naInterestRate and naEmploymentYears because binary data is already a factor and ordering doesn’t matter in binary data). Rubric: 5 points for the glm call, 1 point for summary

glmBase <- glm(isLoanDefault ~ ., family = "binomial", data = loanTrain)

summary(glmBase)
## 
## Call:
## glm(formula = isLoanDefault ~ ., family = "binomial", data = loanTrain)
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             -2.789e+00  1.914e-01 -14.577  < 2e-16 ***
## loanAmount              -1.679e-06  4.191e-06  -0.401 0.688645    
## interestRate             5.482e-02  1.800e-02   3.045 0.002325 ** 
## creditGradeB             4.663e-01  9.011e-02   5.174 2.29e-07 ***
## creditGradeC             7.664e-01  1.221e-01   6.276 3.48e-10 ***
## creditGradeD             8.847e-01  1.530e-01   5.783 7.33e-09 ***
## creditGradeE             9.846e-01  1.972e-01   4.994 5.92e-07 ***
## creditGradeF             1.396e+00  2.752e-01   5.073 3.93e-07 ***
## creditGradeG             1.439e+00  4.287e-01   3.357 0.000788 ***
## employmentYears          1.062e-02  3.573e-03   2.973 0.002949 ** 
## homeLivingOTHER          5.072e-01  3.578e-01   1.417 0.156338    
## homeLivingOWN           -1.621e-01  9.352e-02  -1.733 0.083061 .  
## homeLivingRENT          -8.467e-03  5.321e-02  -0.159 0.873562    
## incomeAnnual            -5.712e-06  7.965e-07  -7.170 7.48e-13 ***
## ageYears                -5.185e-03  3.896e-03  -1.331 0.183176    
## interestRate_missing    -1.634e-02  8.200e-02  -0.199 0.842044    
## employmentYears_missing  8.116e-01  1.124e-01   7.224 5.06e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 13468  on 19393  degrees of freedom
## Residual deviance: 12886  on 19377  degrees of freedom
## AIC: 12920
## 
## Number of Fisher Scoring iterations: 5

[9 points] Q14.

Examine the significance levels of your base model above by reviewing the summary(). Which variables are most significant? Which are mildly significant, and which are not very significant? Tip: Focus on the last column (low Pr(>|z|) values) and asterisks (high significance) to determine which variables are most significant. The least significant variables have high Pr(>|z|) values and few asterisks. Tip: The first two columns are not useful because they are not scaled. Rubric: 1 points for each feature

### This section doesn't require code. Just answer below (outside) the code block.
#Most Significant Variables (Highly significant):

#creditGrade (B, C, D, E, F, G): All categories have low p-values, indicating high significance.
#employmentYears: The variable has a low p-value, indicating high significance.
#incomeAnnual: The variable has a low p-value, indicating high significance.
#employmentYears_missing: The variable has a low p-value, indicating high significance.
#Mildly Significant Variables:

#interestRate: The variable has a moderate p-value, indicating some significance.
#Not Very Significant Variables:

#loanAmount: The variable has a high p-value, indicating low significance.
#homeLiving (OTHER, OWN, RENT): These categories have higher p-values, indicating lower significance.
#ageYears: The variable has a higher p-value, indicating lower significance.
#interestRate_missing: The variable has a higher p-value, indicating lower significance.

[10 points] Q15.

In real life, we would now try to optimize our model by: Filtering through hundreds of features (metaphorically) Consider categorizing or transforming variables with low significance etc. However, for this exercise, we will start our predictions based on this model because we have only a few features. We may refine our model with transformations and categorizations later. Specifically, carry out the following steps: 1. Compute the predictions for each row in the loanTest dataframe using the predict function and store the result in vector called predictionsBase. Modify the predict function below for your needs and/or Google for help. predict(object = yourModel, newdata = test, type = “response”) where your object is your model, newdata is your test data, and type is always “response” (because that gives converts from log-odds to probability) 2. Then, append the predictionsBase vector to loanTest (as its last column). 3. Print a statistical summary of predictionsBase, loanTest[, “isLoanDefault”], and predictionsBase - loanTest[, “isLoanDefault”] to check the accuracy the predictions. Relatively similar summary statistics and difference near zero and are good early signs. Rubric: 6 points for correct use of predict, 1 point for appending the new column, 3 points (1 point each) for summary()

# Assuming your model is stored in glmBase
predictionsBase <- predict(object = glmBase, newdata = loanTest, type = "response")
loanTest$predictionsBase <- predictionsBase
# Summary of predictionsBase
cat("Summary of predictionsBase:\n")
## Summary of predictionsBase:
summary(predictionsBase)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.0000016 0.0622468 0.1035845 0.1099046 0.1455037 0.4958792
# Summary of loanTest[, "isLoanDefault"]
cat("\nSummary of loanTest[, 'isLoanDefault']:\n")
## 
## Summary of loanTest[, 'isLoanDefault']:
summary(loanTest[, "isLoanDefault"])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1121  0.0000  1.0000
# Summary of predictionsBase - loanTest[, "isLoanDefault"]
cat("\nSummary of predictionsBase - loanTest[, 'isLoanDefault']:\n")
## 
## Summary of predictionsBase - loanTest[, 'isLoanDefault']:
summary(predictionsBase - loanTest[, "isLoanDefault"])
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.997289  0.054084  0.091415 -0.002192  0.132609  0.465552
# Summary statistics
summary_stats <- cbind(
  predictionsBase = c(
    Min = min(predictionsBase),
    `1st Qu.` = quantile(predictionsBase, 0.25),
    Median = median(predictionsBase),
    Mean = mean(predictionsBase),
    `3rd Qu.` = quantile(predictionsBase, 0.75),
    Max = max(predictionsBase)
  ),
  isLoanDefault = c(
    Min = min(loanTest[, "isLoanDefault"]),
    `1st Qu.` = quantile(loanTest[, "isLoanDefault"], 0.25),
    Median = median(loanTest[, "isLoanDefault"]),
    Mean = mean(loanTest[, "isLoanDefault"]),
    `3rd Qu.` = quantile(loanTest[, "isLoanDefault"], 0.75),
    Max = max(loanTest[, "isLoanDefault"])
  ),
  difference = c(
    Min = min(predictionsBase - loanTest[, "isLoanDefault"]),
    `1st Qu.` = quantile(predictionsBase - loanTest[, "isLoanDefault"], 0.25),
    Median = median(predictionsBase - loanTest[, "isLoanDefault"]),
    Mean = mean(predictionsBase - loanTest[, "isLoanDefault"]),
    `3rd Qu.` = quantile(predictionsBase - loanTest[, "isLoanDefault"], 0.75),
    Max = max(predictionsBase - loanTest[, "isLoanDefault"])
  )
)

# Print the summary statistics
print(summary_stats)
##             predictionsBase isLoanDefault   difference
## Min            1.648587e-06     0.0000000 -0.997288980
## 1st Qu..25%    6.224677e-02     0.0000000  0.054084136
## Median         1.035845e-01     0.0000000  0.091414705
## Mean           1.099046e-01     0.1120965 -0.002191875
## 3rd Qu..75%    1.455037e-01     0.0000000  0.132608805
## Max            4.958792e-01     1.0000000  0.465551464

[4 points] Q16.

What are your thoughts on the summary of predictionsBase - loanTest[, “isLoanDefault”]? Tip: Check the min, median, mean, max of the value. Rubric: 1 points for observing the each item in the tip.

### This section doesn't require code. Just answer below (outside) the code block.
#The minimum value of -0.997289 indicates instances where the model predicts extremely low probabilities of loan default when the actual outcome is 1. This suggests a potential area for improvement, as the model occasionally underestimates the likelihood of default. The median value of 0.091415 signifies that half of the differences between predictions and actual values are below this threshold, indicating a generally positive bias in the model's estimates. The mean, approximately -0.002192, suggests a small average difference, and the negative sign implies a slight tendency of the model to underestimate the probability of loan default on average. The maximum value of 0.465552 highlights cases where the model predicts probabilities close to 0.5 when the actual outcome is 0, indicating potential challenges in distinguishing non-default cases. This comprehensive analysis provides valuable insights into the model's strengths and areas for refinement, guiding further optimization efforts.

[4 points] Q17.

Notice that isLoanDefault is 0 or 1, while predictionsBase is a probability value (that ranges from between 0 to about 0.5). That means we need to convert the probability distribution to a 0 or 1 prediction using a cutoff value that maps each probability to either 0 or 1. A natural cutoff value is 0.5. Count the number of rows that would be classified as 0 and 1, respectively, by using the cutoff of 0.5. Tip: Add up the number of TRUE values using sum on the respective logical conditions. Rubric: 2 points each.

# Count the number of rows classified as 0 and 1
num_class_0 <- sum(predictionsBase > 0.5)
num_class_1 <- sum(predictionsBase <= 0.5)

num_class_0
## [1] 0
num_class_1
## [1] 9697

[6 points] Q18.

Consider the cutoff of 0.5 based on the output from the last chunk. Would you adjust the cutoff value? Why? How? Tip: I learned this on my own and this was an important lesson for choosing the cutoff values. Rubric: 2 points for each question above

### This section doesn't require code. Just answer below (outside) the code block.
# Yes, because the output indicates that all predictions fall into class 1 (default), and none are classified as class 0 (no default) when using the cutoff of 0.5. I would adjust the cutoff value involves trying different thresholds to find a balance that better aligns with the expected predictions. In this case, lowering the threshold (e.g., to about 0.1) might lead to a more balanced distribution of predictions between levels 0 and 1.

[6 points] Q19.

Now, set a value for a variable called cutoff that improves the cutoff value based on what you know about your model and the data. Then, use this cutoff value to compute a new column vector called isPrediction based on predictionsBase. Then, add this vector as the first column of loanTest (so it can be side by side with isLoanDefault). Finally, print a summary of loanTest to verify that your code works. Tip: I based my cutoff on the fact that 0.1121 of test cases are bad loans. Specifically, I sorted the predictionsBase vector in non-decreasing order (using sort() function with decreasing = TRUE), and then I set the value of cutoff equal to value of the element number as.integer(0.1121 * length(sortedPredictionBase)) of the sorted list. Rubric: 2 points each for choice of cutoff and generating is isPrediction; 1 point each for adding isPrediction as the first column of loanTest and verifying the result.

# Calculate the cutoff based on the proportion of bad loans
sortedPredictionBase <- sort(predictionsBase, decreasing = TRUE)
cutoff_index <- as.integer(0.1121 * length(sortedPredictionBase))
cutoff <- sortedPredictionBase[cutoff_index]

print(cutoff)
##     24511 
## 0.1784657
# Create isPrediction based on the cutoff
isPrediction <- ifelse(predictionsBase > cutoff, 1, 0)

# Add isPrediction as the first column of loanTest
loanTest <- cbind(isPrediction, loanTest)


# Print a summary of loanTest
summary(loanTest)
##   isPrediction   isLoanDefault      loanAmount     interestRate   creditGrade
##  Min.   :0.000   Min.   :0.0000   Min.   :  500   Min.   : 5.42   A:3226     
##  1st Qu.:0.000   1st Qu.:0.0000   1st Qu.: 5000   1st Qu.: 8.49   B:3061     
##  Median :0.000   Median :0.0000   Median : 8000   Median :10.99   C:1935     
##  Mean   :0.112   Mean   :0.1121   Mean   : 9618   Mean   :11.02   D:1103     
##  3rd Qu.:0.000   3rd Qu.:0.0000   3rd Qu.:12375   3rd Qu.:13.16   E: 275     
##  Max.   :1.000   Max.   :1.0000   Max.   :35000   Max.   :23.22   F:  78     
##                                                                   G:  19     
##  employmentYears     homeLiving    incomeAnnual        ageYears    
##  Min.   : 0.000   MORTGAGE:4003   Min.   :   4000   Min.   :20.00  
##  1st Qu.: 2.000   OTHER   :  40   1st Qu.:  40000   1st Qu.:23.00  
##  Median : 4.000   OWN     : 708   Median :  57000   Median :26.00  
##  Mean   : 6.045   RENT    :4946   Mean   :  67621   Mean   :27.76  
##  3rd Qu.: 8.000                   3rd Qu.:  80300   3rd Qu.:30.00  
##  Max.   :56.000                   Max.   :1900000   Max.   :76.00  
##                                                                    
##  interestRate_missing employmentYears_missing predictionsBase    
##  Min.   :0.00000      Min.   :0.00000         Min.   :0.0000016  
##  1st Qu.:0.00000      1st Qu.:0.00000         1st Qu.:0.0622468  
##  Median :0.00000      Median :0.00000         Median :0.1035845  
##  Mean   :0.09477      Mean   :0.02465         Mean   :0.1099046  
##  3rd Qu.:0.00000      3rd Qu.:0.00000         3rd Qu.:0.1455037  
##  Max.   :1.00000      Max.   :1.00000         Max.   :0.4958792  
## 

[8 points] Q20.

Now, compute the number of True Negatives, False Negatives, False Positives, True Positives, respectively. Store the results in tneg, fneg, fpos, and tpos, respectively. Then, print all four values. Tip: tneg = sum( (0 == loanTest\(isPrediction) & (0 == loanTest\)isLoanDefault)) Rubric: 2 point of each value.

# Compute True Negatives, False Negatives, False Positives, True Positives
tneg <- sum((loanTest$isPrediction == 0) & (loanTest$isLoanDefault == 0))
fneg <- sum((loanTest$isPrediction == 0) & (loanTest$isLoanDefault == 1))
fpos <- sum((loanTest$isPrediction == 1) & (loanTest$isLoanDefault == 0))
tpos <- sum((loanTest$isPrediction == 1) & (loanTest$isLoanDefault == 1))

# Print the values
cat("True Negatives:", tneg, "\n")
## True Negatives: 7759
cat("False Negatives:", fneg, "\n")
## False Negatives: 852
cat("False Positives:", fpos, "\n")
## False Positives: 851
cat("True Positives:", tpos, "\n")
## True Positives: 235

[8 points] Q21.

Now, let’s use the confusionMatrix() function to gauge the model’s confusion. The confusionMatrix() function is similar to our true/false negatives/positives calculations above but it also generates more data. Tip: You will need to install/library caret and e1071 to call confusionMatrix() as shown below: confusionMatrix(data = as.factor(predicted), reference =as.factor(actual)) Notice that we have to convert integers to factors. The predicted parameter is loanTest\(isPrediction* and the actual parameter is *loanTest\)isLoanDefault. Rubric: 4 points for installing libraries. 4 points for using confusionMatrix().

# Install and load the required libraries if not already installed
#if (!requireNamespace("caret", quietly = TRUE)) {
#install.packages("caret")
#}

#if (!requireNamespace("e1071", quietly = TRUE)) {
  #install.packages("e1071")
#}
library(caret)
## 载入需要的程辑包:lattice
## 
## 载入程辑包:'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
library(e1071)
conf_matrix <- confusionMatrix(data = as.factor(loanTest$isPrediction), 
                               reference = as.factor(loanTest$isLoanDefault))

# Print confusion matrix
print(conf_matrix)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 7759  852
##          1  851  235
##                                           
##                Accuracy : 0.8244          
##                  95% CI : (0.8167, 0.8319)
##     No Information Rate : 0.8879          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1174          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9012          
##             Specificity : 0.2162          
##          Pos Pred Value : 0.9011          
##          Neg Pred Value : 0.2164          
##              Prevalence : 0.8879          
##          Detection Rate : 0.8001          
##    Detection Prevalence : 0.8880          
##       Balanced Accuracy : 0.5587          
##                                           
##        'Positive' Class : 0               
## 

[6 points] Q22.

Now, let’s return to our choice of cutoff to understand how this choice impacts the Accuracy, Sensitivity, and Specificity. Run the code below to print the confusion matrix for cutoff you used earlier, and verify that this gives you the same results as the previous chunk (while bypassing a lot of of other code). Then, learn about confusion matrix by Googling “accuracy sensitivity specificity confusion matrix” and/or reading https://towardsdatascience.com/taking-the-confusion-out-of-confusion-matrices-c1ce054b3d3e. Rubric: 1 point for running the code; 3 points for learning more about confusion matrix (based on honor code)

### Do not edit this code, just run it!
confusionMatrix(data = as.factor(as.numeric(cutoff < loanTest$predictionsBase)), 
                reference = as.factor(loanTest$isLoanDefault))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 7759  852
##          1  851  235
##                                           
##                Accuracy : 0.8244          
##                  95% CI : (0.8167, 0.8319)
##     No Information Rate : 0.8879          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1174          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9012          
##             Specificity : 0.2162          
##          Pos Pred Value : 0.9011          
##          Neg Pred Value : 0.2164          
##              Prevalence : 0.8879          
##          Detection Rate : 0.8001          
##    Detection Prevalence : 0.8880          
##       Balanced Accuracy : 0.5587          
##                                           
##        'Positive' Class : 0               
## 

[6 points] Q23.

Now, increase the cutoff by 10 percent and report if the following parameters increased or decreased compared to your original cutoff: Accuracy, Sensitivity, and Specificity? Tip: Ignore any warnings. Rubric: 2 points for each Accuracy, Sensitivity, and Specificity

# Increase the cutoff by 10%
new_cutoff <- cutoff* 1.1

# Recompute isPrediction based on the new cutoff
loanTest$isPrediction_new <- ifelse(predictionsBase >= new_cutoff, 1, 0)

# Generate confusion matrix for the new predictions
conf_matrix_new <- confusionMatrix(data = as.factor(loanTest$isPrediction_new), 
                                    reference = as.factor(loanTest$isLoanDefault))

confusionMatrix(data = as.factor(loanTest$isPrediction_new), 
                                    reference = as.factor(loanTest$isLoanDefault))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 8124  942
##          1  486  145
##                                           
##                Accuracy : 0.8527          
##                  95% CI : (0.8455, 0.8597)
##     No Information Rate : 0.8879          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0942          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.9436          
##             Specificity : 0.1334          
##          Pos Pred Value : 0.8961          
##          Neg Pred Value : 0.2298          
##              Prevalence : 0.8879          
##          Detection Rate : 0.8378          
##    Detection Prevalence : 0.9349          
##       Balanced Accuracy : 0.5385          
##                                           
##        'Positive' Class : 0               
## 
# Extract relevant metrics
accuracy_new <- conf_matrix_new$overall["Accuracy"]
sensitivity_new <- conf_matrix_new$byClass["Sensitivity"]
specificity_new <- conf_matrix_new$byClass["Specificity"]

# Print the results
cat("Accuracy:", accuracy_new, "\n")
## Accuracy: 0.852738
cat("Sensitivity:", sensitivity_new, "\n")
## Sensitivity: 0.943554
cat("Specificity:", specificity_new, "\n")
## Specificity: 0.1333947
### Complete the following phrases:
## The Accuracy 
## The Sensitivity 
## The Specificity 

[6 points] Q24.

Now, decrease the cutoff by 10 percent and report if the following parameters increased or decreased: Accuracy, Sensitivity, and Specificity? Rubric: 2 points for each Accuracy, Sensitivity, and Specificity

# Increase the cutoff by 10%
new_cutoff <- cutoff* 0.9

# Recompute isPrediction based on the new cutoff
loanTest$isPrediction_new <- ifelse(predictionsBase >= new_cutoff, 1, 0)

# Generate confusion matrix for the new predictions
conf_matrix_new <- confusionMatrix(data = as.factor(loanTest$isPrediction_new), 
                                    reference = as.factor(loanTest$isLoanDefault))

confusionMatrix(data = as.factor(loanTest$isPrediction_new), 
                                    reference = as.factor(loanTest$isLoanDefault))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 7164  735
##          1 1446  352
##                                           
##                Accuracy : 0.7751          
##                  95% CI : (0.7666, 0.7834)
##     No Information Rate : 0.8879          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1212          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.8321          
##             Specificity : 0.3238          
##          Pos Pred Value : 0.9070          
##          Neg Pred Value : 0.1958          
##              Prevalence : 0.8879          
##          Detection Rate : 0.7388          
##    Detection Prevalence : 0.8146          
##       Balanced Accuracy : 0.5779          
##                                           
##        'Positive' Class : 0               
## 
# Extract relevant metrics
accuracy_new <- conf_matrix_new$overall["Accuracy"]
sensitivity_new <- conf_matrix_new$byClass["Sensitivity"]
specificity_new <- conf_matrix_new$byClass["Specificity"]

# Print the results
cat("Accuracy:", accuracy_new, "\n")
## Accuracy: 0.7750851
cat("Sensitivity:", sensitivity_new, "\n")
## Sensitivity: 0.8320557
cat("Specificity:", specificity_new, "\n")
## Specificity: 0.323827
### Complete the following phrases:
## The Accuracy    
## The Sensitivity 
## The Specificity 

[5 points] Q25.

Find the maximum value you can set to replace 0.1 below without generating an warning. Tip: The summary of predictionsBase will give you a clue.

### Do not edit any other than 0.1 in this code, just run it!
summary(loanTest$predictionsBase)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.0000016 0.0622468 0.1035845 0.1099046 0.1455037 0.4958792
### Changed 0.4958791 to 0.1 for assignment
confusionMatrix(data = as.factor(as.numeric(0.4958791 < loanTest$predictionsBase)), 
                reference = as.factor(loanTest$isLoanDefault))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 8610 1086
##          1    0    1
##                                           
##                Accuracy : 0.888           
##                  95% CI : (0.8816, 0.8942)
##     No Information Rate : 0.8879          
##     P-Value [Acc > NIR] : 0.4952          
##                                           
##                   Kappa : 0.0016          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 1.00000         
##             Specificity : 0.00092         
##          Pos Pred Value : 0.88800         
##          Neg Pred Value : 1.00000         
##              Prevalence : 0.88790         
##          Detection Rate : 0.88790         
##    Detection Prevalence : 0.99990         
##       Balanced Accuracy : 0.50046         
##                                           
##        'Positive' Class : 0               
## 

[3 points] Q26.

What is the danger in maximizing the accuracy? Tip: Review the confusion matrices above.

### This section doesn't require code. Just answer below (outside) the code block.
# If the model achieves high accuracy which will make the sensitivity is high, but specificity is extremely low, indicating that the model hard to effectively distinguish positive cases.

[5 points] Q27.

Examine the output from Chunk 13 and run a new regression with (only) the top 5 predictors of the target variable, and store the result in glmTop5. Then, print the summary of glmTop5. Rubric: 1 point for each of the top 5 predictors.

top_predictors <- c("creditGrade", "incomeAnnual", "interestRate", "employmentYears", "employmentYears_missing")

# Creating the formula for the new model
formula_top5 <- as.formula(paste("isLoanDefault ~", paste(top_predictors, collapse = " + ")))

# Running the new regression with only the top 5 predictors
glmTop5 <- glm(formula_top5, family = "binomial", data = loanTrain)

# Printing the summary of glmTop5
summary(glmTop5)
## 
## Call:
## glm(formula = formula_top5, family = "binomial", data = loanTrain)
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             -2.958e+00  1.528e-01 -19.362  < 2e-16 ***
## creditGradeB             4.649e-01  8.906e-02   5.220 1.79e-07 ***
## creditGradeC             7.647e-01  1.204e-01   6.353 2.12e-10 ***
## creditGradeD             8.796e-01  1.505e-01   5.843 5.13e-09 ***
## creditGradeE             9.776e-01  1.943e-01   5.032 4.85e-07 ***
## creditGradeF             1.393e+00  2.719e-01   5.123 3.00e-07 ***
## creditGradeG             1.413e+00  4.258e-01   3.318 0.000907 ***
## incomeAnnual            -5.804e-06  6.986e-07  -8.308  < 2e-16 ***
## interestRate             5.508e-02  1.761e-02   3.128 0.001763 ** 
## employmentYears          1.040e-02  3.520e-03   2.954 0.003142 ** 
## employmentYears_missing  7.947e-01  1.116e-01   7.118 1.09e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 13468  on 19393  degrees of freedom
## Residual deviance: 12894  on 19383  degrees of freedom
## AIC: 12916
## 
## Number of Fisher Scoring iterations: 5

[3 points] Q28.

Examine the output from the previous chunk and run a new regression with (only) the top 3 predictors of the target variable, and store the result in glmTop3. Then, print the summary of glmTop3. Rubric: 1 point for each of the top 3 predictors.

top3_predictors <- c("creditGrade", "incomeAnnual", "employmentYears_missing")

# Creating the formula for the new model
formula_top3 <- as.formula(paste("isLoanDefault ~", paste(top3_predictors, collapse = " + ")))

# Running the new regression with only the top 3 predictors
glmTop3 <- glm(formula_top3, family = "binomial", data = loanTrain)

# Printing the summary of glmTop3
summary(glmTop3)
## 
## Call:
## glm(formula = formula_top3, family = "binomial", data = loanTrain)
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             -2.488e+00  6.796e-02 -36.616  < 2e-16 ***
## creditGradeB             6.429e-01  6.768e-02   9.499  < 2e-16 ***
## creditGradeC             1.063e+00  7.030e-02  15.116  < 2e-16 ***
## creditGradeD             1.273e+00  7.834e-02  16.255  < 2e-16 ***
## creditGradeE             1.447e+00  1.169e-01  12.385  < 2e-16 ***
## creditGradeF             1.929e+00  2.047e-01   9.419  < 2e-16 ***
## creditGradeG             2.085e+00  3.662e-01   5.694 1.24e-08 ***
## incomeAnnual            -5.426e-06  6.820e-07  -7.957 1.77e-15 ***
## employmentYears_missing  7.770e-01  1.112e-01   6.990 2.74e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 13468  on 19393  degrees of freedom
## Residual deviance: 12912  on 19385  degrees of freedom
## AIC: 12930
## 
## Number of Fisher Scoring iterations: 5

[5 points] Q29.

Based on the AIC scores which one of the following would you choose for modeling and why: glmBase, glmTop5, glmTop3?

### This section doesn't require code but feel free to reprint any critical values.
#glmBase: AIC = 12920
#glmTop5: AIC = 12916
#glmTop3: AIC = 12930
#I'll choose the model with the lowest AIC is glmTop5 with an AIC of 12916. Therefore, based on the AIC scores, glmTop5 would be the preferred model for modeling, as it indicates a better fit to the data compared to the other models.

Answer: I’ll choose …

[5 points] Q30.

Knit to html after eliminating all the errors. Submit both the Rmd and html files. Tip: Do not worry about minor formatting issues. Tip: This will take some time as you are processing medium size data sets.

### This section doesn't require code. Just knit and submit the Rmd and html files.###