We will be working with the Default
dataset to explore how we can predict logistic regression.
# Load the data
library(ISLR)
data(Default)
Let’s start by examining the data:
head(Default)
## default student balance income
## 1 No No 729.5265 44361.625
## 2 No Yes 817.1804 12106.135
## 3 No No 1073.5492 31767.139
## 4 No No 529.2506 35704.494
## 5 No No 785.6559 38463.496
## 6 No Yes 919.5885 7491.559
summary(Default)
## default student balance income
## No :9667 No :7056 Min. : 0.0 Min. : 772
## Yes: 333 Yes:2944 1st Qu.: 481.7 1st Qu.:21340
## Median : 823.6 Median :34553
## Mean : 835.4 Mean :33517
## 3rd Qu.:1166.3 3rd Qu.:43808
## Max. :2654.3 Max. :73554
str(Default)
## 'data.frame': 10000 obs. of 4 variables:
## $ default: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ student: Factor w/ 2 levels "No","Yes": 1 2 1 1 1 2 1 2 1 1 ...
## $ balance: num 730 817 1074 529 786 ...
## $ income : num 44362 12106 31767 35704 38463 ...
?Default
# Make some plots of the data that you think are helpful in understanding the data (don't worry about running regression lines through them, just plot)
# then I will live code the regression line
# how does income relate to balance?
sp0 = ggplot(data=Default, aes(x=income, y=balance)) + # add color=student
geom_point() +
geom_smooth(method = "lm")
sp1 = ggplot(data=Default, aes(x=income, y=balance, color=student)) +
geom_point() +
geom_smooth(method = "lm") +
scale_color_manual(values=c("#999999", "#E69F00"))
cor(Default$income, Default$balance)
## [1] -0.1522434
model <- lm(balance ~ income, data=Default)
summary(model)$coefficient
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.020449e+03 1.293220e+01 78.90765 0.00000e+00
## income -5.521812e-03 3.585042e-04 -15.40236 6.38044e-53
# weak negative relationship and correlation
# whether you are a student or not it doesn't strongly affect any relationships between income and balance.
# But whether you're a student or not affects your income (which makes sense). You're not really making money when you're in school!)
sp2 = ggplot(data=Default, aes(x=balance,y=default, color=student)) +
geom_point() #+ geom_smooth(method="lm")
# Whether you're a student or not does not seem to credit balance
sp3 =ggplot(data=Default, aes(x=income,y=default, color=student)) +
geom_point() #+ geom_smooth(method="lm")
# Again, we see whether or not you're a student affects your income.
# this tells me nothing:
sp4 = ggplot(data=Default, aes(x=student, y=default)) +
geom_point()
# can have only 4 points (yes, yes) (yes, no) (no, yes) (no, no)
What do your plots tell you? Are there any interesting observations?
I noticed there were 10,000 observations, where each row is a person and 4 columns. Default (categorical response variable) which identifies who failed to make a payment on a credit card debt by the due date, student (categorical predictor variable), and two continuous predictor variables balance (the average balance that the customer has remaining on their credit card after making their monthly payment) and income (income of the customer).
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
grid.arrange(sp0, sp1, sp2, sp3, nrow=2) #sp4 tells me nothing
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
Default
dataIn the remainder of this exercise, we will be primarily interested in the probability that an individual defaults based on a handful of characteristics about them. The variable default
will be the response, and the others will be covariates.
Before fitting the model in the next code chunk, consider some ways your model will work well and in some ways it won’t.
data(Default)
# Convert response variable from a factor (e.g., 1/2 to 0/1) to numeric by recoding your data
Default_ <- Default # unique(Default$default) # how to check your levels
Default_$default <- as.numeric(Default_$default)-1
Default$default_yes <- as.numeric(Default$default)-1
# Fit a linear model with default as a response
# and the other variables as predictors
# mod.lm <- lm(default ~ student + balance + income, data=Default_)
mod.lm <- lm(default_yes ~ student + balance + income, data=Default)
# Review the summary of the model
kable(round(summary(mod.lm)$coefficients,2))
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | -0.08 | 0.01 | -9.68 | 0.00 |
studentYes | -0.01 | 0.01 | -1.82 | 0.07 |
balance | 0.00 | 0.00 | 37.41 | 0.00 |
income | 0.00 | 0.00 | 1.04 | 0.30 |
round(summary(mod.lm)$coefficients,2)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.08 0.01 -9.68 0.00
## studentYes -0.01 0.01 -1.82 0.07
## balance 0.00 0.00 37.41 0.00
## income 0.00 0.00 1.04 0.30
# Plot the results of the model, specifically default vs. balance
ggplot(data=Default_, aes(x=balance,y=default)) +
geom_point() +
geom_smooth(method="lm")
## `geom_smooth()` using formula 'y ~ x'
How does your model look? Are there some things that you wouldn’t expect?
You’re not going to be predicting below 0. And you can see that for very large and very small values, the liklehood is fairly certain but there is more uncertainty in the middle.
Default
dataTo address the inadequacies of the linear model for this dataset, we try using logistic regression. For this we need a new function in \(\mathsf{R}\) called glm
. The format is for the most part the same as lm
, but we now need to specify a family. Since our data takes on 0s and 1s, we are making the assumption that the noise is Bernoulli, or more precisely
\[
Y_i\mid X=x \sim \text{Bernoulli}(p_i)
\] This means we need to specify that the family
is binomial
. While we won’t get into this too much unless time permits, we can also specify a link, which here we use link=logit
. This tells glm
to use the logistic cumulative distribution function (CDF) in the model.
Now try running it yourself!
# Fit the logistic model with the same formula as the linear model
binDef <- glm(default ~ student + balance + income,
family=binomial(link="logit"),
data=Default)
# Review the summary output
kable(round(summary(binDef)$coefficient,2))
Estimate | Std. Error | z value | Pr(>|z|) | |
---|---|---|---|---|
(Intercept) | -10.87 | 0.49 | -22.08 | 0.00 |
studentYes | -0.65 | 0.24 | -2.74 | 0.01 |
balance | 0.01 | 0.00 | 24.74 | 0.00 |
income | 0.00 | 0.00 | 0.37 | 0.71 |
How would you interpret these coefficients?
Balance and income are close to zero. That means that the logit of p is constant, meaning Y is independent of X.
However the odds of defaulting on credit card debt multiplies by e^b:
exp(-0.65)
## [1] 0.5220458
as you go from a student to a non-student.
The odds of defaulting on credit card debt multiplies by
exp(.01)
## [1] 1.01005
for every 1 unit increase in balance.
confint(binDef, level=0.95)
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) -1.185902e+01 -9.928174e+00
## studentYes -1.109018e+00 -1.822147e-01
## balance 5.294898e-03 6.204587e-03
## income -1.304712e-05 1.912447e-05
Remember you are working on probability scales instead of unit scales.
\(p_i = \exp(\beta_1 x_1 + \beta_2 x_2...)\) / \(1 + exp(\beta_1 x_1 + \beta_2 x_2...)\) where i = observation
\(logit_i = \log(p_i / 1 - pi)\)
# Make predictions and plot the result
predDef <- predict(binDef,
type="response") # <-- why do we need this??
predDef2 <- predict(binDef)
max(predDef) # this is the probability
## [1] 0.9776263
max(predDef2) # this is the logit
## [1] 3.777239
logit = log(.977/(1-.977)) # how to calculate the logit of p
p = exp((1))
Default <- Default %>%
mutate(defaultNum = ifelse(default=="Yes",1,0))
Default <- Default %>%
mutate(predDef = predDef)
pBinDef <- ggplot(data=Default, aes(x=balance, y=defaultNum)) +
geom_point() +
geom_line(data=Default, aes(x=balance, y=predDef, color=Default$student)) +
xlab("Balance") +
ylab("Likelihood of Default") +
labs(title="Probability of Default given Debt Balance") +
guides(color=guide_legend("Student"))
pBinDef
## Warning: Use of `Default$student` is discouraged. Use `student` instead.
Some discussion questions:
type=response
in the prediction?The type=“response” option tells R to output probabilities of the form P(Y = 1|X) , as opposed to other information such as the logit.
Do you think the model overcomes the difficulties of the lm model? In what ways?
Are there any drawbacks to this model? It would be important to report but can try removing income because has no effect on the probability of defaulting.
binDef2 <- glm(default ~ student + balance,
family=binomial(link="logit"),
data=Default)
kable(round(summary(binDef2)$coefficient,2))
Estimate | Std. Error | z value | Pr(>|z|) | |
---|---|---|---|---|
(Intercept) | -10.75 | 0.37 | -29.12 | 0 |
studentYes | -0.71 | 0.15 | -4.85 | 0 |
balance | 0.01 | 0.00 | 24.75 | 0 |
End of section 1
In this section, we are going to fit a larger model that looks at not only the main effects of student
, balance
, and income
on default, but it also looks at interaction effects. Start by fitting the larger model and that incorporates interaction effects (still using family=binomial
and link="logit"
). Then compute the deviance between the models (remember smaller models have larger deviance!). Also compute the residual degrees of freedom, and conduct a hypothesis test
# deviance is a goodness-of-fit statistic using sum of squares of residuals (and you calculate it by taking the difference of likelihoods between the fitted model and the saturated model.)
# Fit a larger model that includes all 2-way interactions
# In R, you can do this by replacing Z ~ x + y + Z with Z ~ x * y * z
binDefLarge <- glm(default ~ student*balance*income,
family=binomial(link="logit"),
data=Default) # this is the saturated model (saturated model is more likely given the sample since it is basically the sample itself)
binDefNull <- glm(default ~ 1,
family=binomial(link="logit"),
data=Default)
binDefB <- glm(default ~ balance,
family=binomial(link="logit"),
data=Default)
# Compute difference in deviances and degrees of freedom
# Hint: Once you fit binDef/binDefLarge, take a look at the available attributes
# for each model by typing binDef$, and R should autocomplete the available options
# If you're stuck, review the slides
diffDev <- binDef$deviance - binDefLarge$deviance
diffDF <- binDef$df.residual - binDefLarge$df.residual
# diffDev <- binDefNull$deviance - binDefB$deviance
# diffDF <- binDefNull$df.residual - binDefB$df.residual
# Difference in degrees of freedom
diffDF
## [1] 4
# Difference in deviances
round(diffDev,3) # small deviance
## [1] 0.586
# Compute p-value
round(1-pchisq(diffDev, diffDF),2) # large p value so smaller model fits better
## [1] 0.96
# saturated model is a lesser fit than the simpler model
# large test statistics and small pvalues suggest that the the larger model fits better (smaller model fits more poorly than the larger model.)
In this last section, we are going to look at model diagnostics, which provide a way us a way with which to assess the model. The following plots look at the residuals in the response variable as a function of the covariates.
res1 <- ggplot(data=NULL, aes(x=Default$income, y=residuals(binDef), color=Default$student)) +
geom_point() +
xlab("Income") +
ylab("Deviance Residuals") +
labs(title="Deviance Residuals by Income")
res2<- ggplot(data=NULL, aes(x=Default$student, y=residuals(binDef))) +
geom_boxplot() +
xlab("Student") +
ylab("Deviance Residuals") +
labs(title="Deviance Residuals by Student Type")
res3 <- ggplot(data=NULL, aes(x=Default$balance, y=residuals(binDef), color=Default$student)) +
geom_point() +
xlab("Balance") +
ylab("Deviance Residuals") +
labs(title="Deviance Residuals by Balance")
res1
res2
res3
What do you think these plots suggest about the quality of the model? Do they suggest any major issues with the model?
As balance increases, the deviance residuals get pulled from positive to negative values. For student categories, the deviance residuals are similar. For income,a lot of negative residuals (seems to be driving the overshooting of the model)
Plot the histograms of of the residuals.
resHist <- ggplot(data=NULL, aes(x=residuals(binDef))) +
geom_histogram() +
xlab("Residuals") +
ylab("Counts") +
labs(title="Distribution of Deviance Residuals")
resHist
residual is the actual (observed) value minus the predicted value.
If you have a negative value for a residual it means the actual value was LESS than the predicted value. Our model is overshooting the actual values.
glm(formula = default ~ student + balance + income, family = binomial(link = “logit”), data = Default)
So we might want to consider a simpler model, remove income.
Construct a QQ Plot of the residuals.
qqnorm(residuals(binDef))
is a graphical tool to help us assess if a set of data plausibly came from some theoretical distribution such as a Normal or exponential. Normal Q-Q plots that look like this usually mean your sample data are skewed. Which makes sense because we had a lot of No’s (with downward concavity corresponding to negative skewness (long tail to the left) and upward concavity indicating positive skewness.)
Run the following code to create a plot of the leverage points.
halfnorm(hatvalues(binDef))
head(hatvalues(binDef))
## 1 2 3 4 5 6
## 6.375400e-05 6.575620e-05 2.344373e-04 2.842522e-05 7.491062e-05 1.075051e-04
Based on the previous few plots, has your interpretation of the model quality changed at all? If so, how?
By sorting the data you isolate influential points. If an observation has a response value that is very different from the predicted value based on a model, then that observation is called an outlier. On the other hand, if an observation has a particularly unusual combination of predictor values (e.g., one predictor has a very different value for that observation compared with all the other data observations), then that observation is said to have high leverage. Can be both high leverage and influential but also different
High-leverage points are those observations, if any, made at extreme or outlying values of the independent variables such that the lack of neighboring observations means that the fitted regression model will pass close to that particular observation
An outlier is a data point whose response y does not follow the general trend of the rest of the data.
A data point has high leverage if it has “extreme” predictor x values. With a single predictor, an extreme x value is simply one that is particularly high or low. With multiple predictors, extreme x values may be particularly high or low for one or more predictors, or may be “unusual” combinations of predictor values (e.g., with two predictors that are positively correlated, an unusual combination of predictor values might be a high value of one predictor paired with a low value of the other predictor).