Visit my website for more like this!
Heavily borrowed from:
Textbook: Introduction to statistical learning
Textbook: Elements of statistical learning
UCLA Example link
require(knitr)
## Loading required package: knitr
Linear regression ([tutorial here()]) assumes that the response variable Y is quantitative. However, in many situations we are dealing with qualitative response variables. Generally, we will refer to these types of variables as categorical variables. For example: eye color is categorical since it has values like brown, blue, and green. Classification thereby involves assigning categorical variables to a specific class. Usually, we predict the probability of any observation belonging to a specific class.
There are many classification techniques, or classifiers, that could be used to predict a given qualitative response variables. Examples covered in this notebook include:
Later notebooks (link here) will include more complicated classifiers such as:
Just like linear regression, in classification we have a set of training observations which we leverage to build a classifier, and we test our model performance on the test data to simulate out of sample error. In this notebook we will use a dataset of credit card information as model inputs to predict whether an individual will default on their credit card payment.
# Load the textbook R package
require(ISLR)
# Load in the credit data
attach(Default)
# Lets take a look at the data
str(Default)
## 'data.frame': 10000 obs. of 4 variables:
## $ default: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ student: Factor w/ 2 levels "No","Yes": 1 2 1 1 1 2 1 2 1 1 ...
## $ balance: num 730 817 1074 529 786 ...
## $ income : num 44362 12106 31767 35704 38463 ...
# How many people actual default?
tmp <- table(default)
(tmp[[2]]/tmp[[1]])*100
## [1] 3.445
We can see that these data have 10000 observations of 4 variables, and that only about 3% of people actually default. Let’s create a few diagnostic plots to get a sense of the data. Remember, the goal here will be to predict whether someone will default on their credit card payment, based on the variables student
, balance
and income
.
library(ggplot2); library(gridExtra)
## Loading required package: grid
x <- qplot(x=balance, y=income, color=default, shape=default, geom='point')+scale_shape(solid=FALSE)
y <- qplot(x=default, y=balance, fill=default, geom='boxplot')+guides(fill=FALSE)
z <- qplot(x=default, y=income, fill=default, geom='boxplot')+guides(fill=FALSE)
# Plot
x
grid.arrange(y, z, nrow=1)
We see that the relationship between the predictor balance
and the default rate is rather pronounced, this will likely be a major predictor in the forthcoming model.
In the Default
dataset, where the response variable falls into two categories (yes or no in this case). Rather than modeling the response Y directly, logistic regression
models the probability that Y belongs to a particular category. Therefore, in our case with the Default
data, the logistic regression models the probability of defaulting. For example, the probability of default
given balance
can be written as
\[Pr(default = Yes|balance)\]
where the values range from 0 to 1. Then for any given value of balance, a prediction can be made for default
. For example, we could predict that default = yes
for any person whose predicted probability of defaulting is > 0.5. Yet, we can use these probabilities in various fashions. For example, a company who wishes to conservatively predict individuals who are at risk of default could choose a lower probability threshold, say > 0.1.
Logistic regression uses the logistic function fitted by maximum likelihood. The logistic function will always produce an S-shaped curve, so regardless of the value of x, we will always return a sensible prediction. To interpret the model we can rearrange the equation so that we return the __odds_. Odds are traditionally used instead of probabilities in horse-racing, since they relate more naturally to the correct betting strategy. Taking the log of the equation simplifies it further. Now it looks similar to what we have seen in linear regression.
\[log(p(X) / 1-p(X)) = β0 + β1X\]
The left hand side is now the log-odds or logit. Instead of linear regression where one unit change in the response variable Y results in one unit change of X, in logistic regression, increasing X by one unit changes the log odds by \(β1\). Since the probability response is not a straight line, the amount of change one unit has across values of X changes depending on the value of X. However, if \(β1\) is positive, then increasing X will result in an increase in probability, and vice-versa.
The coefficients \(β0\) and \(β1\) are initially unknown, and must be estimated based on the available training data. Recall that linear regression uses the least squares approach to estimate the coefficients. For logistic regression, the more general method of maximum likelihood is preferred for its robust statistical properties. Basically, the algorithm tries to find coefficients that maximize the likelihood that the probabilities are closest to 1 for people who defaulted, and close to zero for people who did not.
Let’s fit a basic model to these values and inspect the results.
logit <- glm(default ~ balance, data=Default, family='binomial')
summary(logit)
##
## Call:
## glm(formula = default ~ balance, family = "binomial", data = Default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.270 -0.146 -0.059 -0.022 3.759
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -10.65133 0.36116 -29.5 <2e-16 ***
## balance 0.00550 0.00022 24.9 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1596.5 on 9998 degrees of freedom
## AIC: 1600
##
## Number of Fisher Scoring iterations: 8
Here we see that \(β1\) is 0.005, this means that balance
is associated with an increase in probability of default
. Specifically, a one-unit increase in balance
increases the log odds of default
by 0.0055 units.
Other components of the logistic regression output are similar to linear regression. We can use the standard error to measure accuracy, and the P-value to assess significant influence in the model.
for qualitative predictors, we can take the same dummy variable approach as in linear regression
# Create a new dummy variable for students
Default$studentD <- 0
Default$studentD[Default$student=="Yes"] <- 1
logit <- glm(default ~ studentD, data=Default, family='binomial')
summary(logit)
##
## Call:
## glm(formula = default ~ studentD, family = "binomial", data = Default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.297 -0.297 -0.243 -0.243 2.659
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.5041 0.0707 -49.55 < 2e-16 ***
## studentD 0.4049 0.1150 3.52 0.00043 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 2908.7 on 9998 degrees of freedom
## AIC: 2913
##
## Number of Fisher Scoring iterations: 6
From these diagnostics we see that the model predicts that students tend to have higher default probabilities than non-students.
logit <- glm(default ~ income + balance + studentD, family='binomial', data=Default)
summary(logit)
##
## Call:
## glm(formula = default ~ income + balance + studentD, family = "binomial",
## data = Default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.469 -0.142 -0.056 -0.020 3.738
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.09e+01 4.92e-01 -22.08 <2e-16 ***
## income 3.03e-06 8.20e-06 0.37 0.7115
## balance 5.74e-03 2.32e-04 24.74 <2e-16 ***
## studentD -6.47e-01 2.36e-01 -2.74 0.0062 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1571.5 on 9996 degrees of freedom
## AIC: 1580
##
## Number of Fisher Scoring iterations: 8
There is a very interesting result here. Recall that for the bi variate logistic regression between default
and student
predicted that students are more likely to default than non-students. However, we see here that when paired with balance and income, students
are less likely to default than non-students. What is going on here? This means that for a fixed value of balance
and income
, a student is less likely to default than a non-student. In-fact, the student default rate is below non-student default rates for all values of balance. Yet, if we averaged the students variable, the overall student default rate is higher than the non-student default rate, which is why the bi variate regression indicates a positive coefficient. The reason for this is because the features student
and balance
are mildly correlated. Thus, students tend to hold higher levels of debt, which is associated with higher probability of defaulting. Thus we could conclude that a student is more likely to default than a non-student if no credit card balance information is available. However, students are actually less risky than non-students with the same credit card balance. This phenomenon is known as confounding.
In this dataset we are interested in building a predictive model to see how variables, like the GRE (Graduate Record Exam scores), GPA, and prestige of the individuals undergraduate institution, effect graduate school admission. The response variable Y being admit / not-admit
.
# Load the data
data <- read.csv("http://www.ats.ucla.edu/stat/data/binary.csv")
str(data)
## 'data.frame': 400 obs. of 4 variables:
## $ admit: int 0 1 1 1 0 1 1 0 1 0 ...
## $ gre : int 380 660 800 640 520 760 560 400 540 700 ...
## $ gpa : num 3.61 3.67 4 3.19 2.93 3 2.98 3.08 3.39 3.92 ...
## $ rank : int 3 3 1 4 4 2 1 2 3 2 ...
# Some basic statistics
summary(data)
## admit gre gpa rank
## Min. :0.000 Min. :220 Min. :2.26 Min. :1.00
## 1st Qu.:0.000 1st Qu.:520 1st Qu.:3.13 1st Qu.:2.00
## Median :0.000 Median :580 Median :3.40 Median :2.00
## Mean :0.318 Mean :588 Mean :3.39 Mean :2.48
## 3rd Qu.:1.000 3rd Qu.:660 3rd Qu.:3.67 3rd Qu.:3.00
## Max. :1.000 Max. :800 Max. :4.00 Max. :4.00
sapply(data, sd) # standard deviation
## admit gre gpa rank
## 0.4661 115.5165 0.3806 0.9445
# Two-way contigency table, to ensure no 0 cells
xtabs(~admit + rank, data=data)
## rank
## admit 1 2 3 4
## 0 28 97 93 55
## 1 33 54 28 12
Next we will have to clean the data a little, since the admit
and feature is an integer type, while we need it as categorical factors.
data$rank <- factor(data$rank)
# Fit the first logit model
logit <- glm(admit ~ gre + gpa + rank, data=data, family = 'binomial')
summary(logit)
##
## Call:
## glm(formula = admit ~ gre + gpa + rank, family = "binomial",
## data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.627 -0.866 -0.639 1.149 2.079
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.98998 1.13995 -3.50 0.00047 ***
## gre 0.00226 0.00109 2.07 0.03847 *
## gpa 0.80404 0.33182 2.42 0.01539 *
## rank2 -0.67544 0.31649 -2.13 0.03283 *
## rank3 -1.34020 0.34531 -3.88 0.00010 ***
## rank4 -1.55146 0.41783 -3.71 0.00020 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 499.98 on 399 degrees of freedom
## Residual deviance: 458.52 on 394 degrees of freedom
## AIC: 470.5
##
## Number of Fisher Scoring iterations: 4
Excellent, these are rather strong looking results. We see that all the variables are significant. Using the coefficients we can begin to interpret the meaning of the results. R outputs these as log odds.
gre
, the log odds of admission increases by 0.0022.gpa
, the log odds of being admitted increases by 0.804.rank
variables have a slightly different interpretation. If you attended a rank 2, versus a rank 1, changes the log odds of admission by -0.675. Recall with dummy variables, that we are treating rank 1
as the baseline feature. So all the visible ranks are in reference to rank 1.Next we can use the confint
function to obtain confidence intervals for the coefficient estimates. For logistic regression, these are based on the log-likelihood function. We could also get confidence using the standard errors.
confint(logit) # Using LL
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) -6.2716202 -1.792547
## gre 0.0001376 0.004436
## gpa 0.1602959 1.464143
## rank2 -1.3008888 -0.056746
## rank3 -2.0276713 -0.670372
## rank4 -2.4000265 -0.753543
confint.default(logit) # Using standard errors
## 2.5 % 97.5 %
## (Intercept) -6.2242419 -1.755716
## gre 0.0001202 0.004409
## gpa 0.1536837 1.454391
## rank2 -1.2957513 -0.055135
## rank3 -2.0169921 -0.663416
## rank4 -2.3703986 -0.732529
Since the rank variable is qualitative, and is split into individual ranks, we could also test the overall effect of rank
in the model using a wald.test
from the aod
library.
library(aod)
wald.test(b = coef(logit), Sigma = vcov(logit), Terms=4:6)
## Wald test:
## ----------
##
## Chi-squared test:
## X2 = 20.9, df = 3, P(> X2) = 0.00011
As expected, the effect of rank
is statistically significant. We can further this analysis by testing if there are significant differences between levels 2-4 of the rank variable. This is not present in the existing model since we are comparing all levels of rank to the baseline rank 1. Below we test that the coefficient for rank
=2 is equal to the coefficient for rank
=3. The first line of code below creates a vector l that defines the test we want to perform. In this case, we want to test the difference (subtraction) of the terms for rank
=2 and rank
=3 (i.e., the 4th and 5th terms in the model). To contrast these two terms, we multiply one of them by 1, and the other by -1. The other terms in the model are not involved in the test, so they are multiplied by 0. The second line of code below uses L=l to tell R that we wish to base the test on the vector l (rather than using the Terms option as we did above).
l <- cbind(0,0,0,1,-1,0)
wald.test(b=coef(logit), Sigma=vcov(logit), L=l)
## Wald test:
## ----------
##
## Chi-squared test:
## X2 = 5.5, df = 1, P(> X2) = 0.019
Indeed, there is a significant difference between rank 2 and 3.
Currently, the coefficients are output as log ratios, which are very difficult to interpret in a meaningful way. To remedy this, we can exponentiation the coefficients and interpret them as odds ratios.
# Odds ratios
exp(coef(logit))
## (Intercept) gre gpa rank2 rank3 rank4
## 0.0185 1.0023 2.2345 0.5089 0.2618 0.2119
# Odds ratio and 95% CI
exp(cbind(OR = coef(logit), confint(logit)))
## Waiting for profiling to be done...
## OR 2.5 % 97.5 %
## (Intercept) 0.0185 0.001889 0.1665
## gre 1.0023 1.000138 1.0044
## gpa 2.2345 1.173858 4.3238
## rank2 0.5089 0.272290 0.9448
## rank3 0.2618 0.131642 0.5115
## rank4 0.2119 0.090716 0.4707
Now we can say more relevant things like, for every one unit increase in gpa
, the odds of being admitted to graduate school (vs not being admitted) increases by a factor of 2.23.
We can also use the predicted probabilities to understand the model. First we will calculate predicted probabilities of admission at each value of rank
, holding gre
and gpa
at their means.
# Create new dataframe
ranktest <- with(data, data.frame(gre=mean(gre), gpa=mean(gpa), rank=factor(1:4)))
ranktest
## gre gpa rank
## 1 587.7 3.39 1
## 2 587.7 3.39 2
## 3 587.7 3.39 3
## 4 587.7 3.39 4
ranktest$rankP <- predict(logit, newdata=ranktest, type='response')
ranktest
## gre gpa rank rankP
## 1 587.7 3.39 1 0.5166
## 2 587.7 3.39 2 0.3523
## 3 587.7 3.39 3 0.2186
## 4 587.7 3.39 4 0.1847
Very cool, now we can see very clearly, that if all else remains constant, a rank 1 school student will have a 52% chance of being admitted, while the lowest ranked schools will have an 18% chance of admittance.
We can also extend this same procedure for a range of gre
values, rather than just the means. We now test the effect of gre
ranging from a score between 200 and 800 at each rank.
tmp <- with(data, data.frame(gre=rep(seq(200, 800, length.out=100), 4), gpa=mean(gpa), rank = factor(rep(1:4, each = 100))))
# Get probabilities and standard errors (for plotting)
probs <- cbind(tmp, predict(logit, newdata=tmp, type='link', se=TRUE))
# Get a look at this new data
head(probs)
## gre gpa rank fit se.fit residual.scale
## 1 200.0 3.39 1 -0.8115 0.5148 1
## 2 206.1 3.39 1 -0.7978 0.5091 1
## 3 212.1 3.39 1 -0.7840 0.5034 1
## 4 218.2 3.39 1 -0.7703 0.4978 1
## 5 224.2 3.39 1 -0.7566 0.4922 1
## 6 230.3 3.39 1 -0.7429 0.4866 1
probs <- within(probs, {
PredProb <- plogis(fit) # fit logistic curve
LL <- plogis(fit - (1.96 * se.fit)) # create lower limits
UL <- plogis(fit + (1.96 * se.fit))
})
Graph the output…
ggplot(probs, aes(x=gre, y=PredProb)) +
geom_ribbon(aes(ymin=LL, ymax=UL, fill=rank), alpha=0.2) +
geom_line(aes(color=rank))
Coming later…
This data can be found in the ISLR
package. The dataset is percentage returns for the S&P 500 stock index over 1 250 dates from 2001 to 2005. For each date there are percentage returns recorded for each of the five previous trading dates, Lag1 - Lag5
. There is also recorded volume
(the number of shares traded on the previous day, in billions), Today
: the percentage return on the given date, and Direction
: whether the market was up or down on the given date.
library(ISLR)
str(Smarket)
## 'data.frame': 1250 obs. of 9 variables:
## $ Year : num 2001 2001 2001 2001 2001 ...
## $ Lag1 : num 0.381 0.959 1.032 -0.623 0.614 ...
## $ Lag2 : num -0.192 0.381 0.959 1.032 -0.623 ...
## $ Lag3 : num -2.624 -0.192 0.381 0.959 1.032 ...
## $ Lag4 : num -1.055 -2.624 -0.192 0.381 0.959 ...
## $ Lag5 : num 5.01 -1.055 -2.624 -0.192 0.381 ...
## $ Volume : num 1.19 1.3 1.41 1.28 1.21 ...
## $ Today : num 0.959 1.032 -0.623 0.614 0.213 ...
## $ Direction: Factor w/ 2 levels "Down","Up": 2 2 1 2 2 2 1 2 2 2 ...
# Correlation minus the Direction variable
cor(Smarket[,-9])
## Year Lag1 Lag2 Lag3 Lag4 Lag5 Volume
## Year 1.00000 0.029700 0.030596 0.033195 0.035689 0.029788 0.53901
## Lag1 0.02970 1.000000 -0.026294 -0.010803 -0.002986 -0.005675 0.04091
## Lag2 0.03060 -0.026294 1.000000 -0.025897 -0.010854 -0.003558 -0.04338
## Lag3 0.03319 -0.010803 -0.025897 1.000000 -0.024051 -0.018808 -0.04182
## Lag4 0.03569 -0.002986 -0.010854 -0.024051 1.000000 -0.027084 -0.04841
## Lag5 0.02979 -0.005675 -0.003558 -0.018808 -0.027084 1.000000 -0.02200
## Volume 0.53901 0.040910 -0.043383 -0.041824 -0.048414 -0.022002 1.00000
## Today 0.03010 -0.026155 -0.010250 -0.002448 -0.006900 -0.034860 0.01459
## Today
## Year 0.030095
## Lag1 -0.026155
## Lag2 -0.010250
## Lag3 -0.002448
## Lag4 -0.006900
## Lag5 -0.034860
## Volume 0.014592
## Today 1.000000
From this output we can see there is no real correlation between the percent returns Today
and any of the lag
variables, but there is a notable correlation between the year
and volume
. The correlation we are seeing is that Volume is increasing over time.
attach(Smarket)
plot(Volume)
Next let’s start by fitting a logistic regression will all the variables using the glm()
function, with family='binomial
.
# Split data into testing and training
train<-Smarket[Year<2005,]
test<-Smarket[Year==2005,]
Here we fit a logistic model to the stock market data using first the build in R functions, then using the caret
package.
library(caret)
## Loading required package: lattice
logit <- glm(Direction ~ Lag1+Lag2+Lag3, family='binomial', data=train)
summary(logit)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3, family = "binomial",
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.34 -1.19 1.07 1.16 1.34
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.03223 0.06338 0.51 0.61
## Lag1 -0.05552 0.05171 -1.07 0.28
## Lag2 -0.04430 0.05167 -0.86 0.39
## Lag3 0.00881 0.05150 0.17 0.86
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1383.3 on 997 degrees of freedom
## Residual deviance: 1381.4 on 994 degrees of freedom
## AIC: 1389
##
## Number of Fisher Scoring iterations: 3
# Run the model on the test set
test.probs <-predict(logit, test, type='response')
pred.logit <- rep('Down',length(test.probs))
pred.logit[test.probs>=0.5] <- 'Up'
table(pred.logit, test$Direction)
##
## pred.logit Down Up
## Down 39 31
## Up 72 110
confusionMatrix(test$Direction, pred.logit)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 39 72
## Up 31 110
##
## Accuracy : 0.591
## 95% CI : (0.528, 0.653)
## No Information Rate : 0.722
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.137
## Mcnemar's Test P-Value : 8.1e-05
##
## Sensitivity : 0.557
## Specificity : 0.604
## Pos Pred Value : 0.351
## Neg Pred Value : 0.780
## Prevalence : 0.278
## Detection Rate : 0.155
## Detection Prevalence : 0.440
## Balanced Accuracy : 0.581
##
## 'Positive' Class : Down
##
modelFit<- train(Direction~Lag1+Lag2+Lag3, method='glm',preProcess=c('scale', 'center'), data=train, family=binomial(link='logit'))
summary(modelFit)
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.34 -1.19 1.07 1.16 1.34
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.0321 0.0634 0.51 0.61
## Lag1 -0.0683 0.0636 -1.07 0.28
## Lag2 -0.0545 0.0635 -0.86 0.39
## Lag3 0.0109 0.0635 0.17 0.86
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1383.3 on 997 degrees of freedom
## Residual deviance: 1381.4 on 994 degrees of freedom
## AIC: 1389
##
## Number of Fisher Scoring iterations: 3
confusionMatrix(test$Direction, predict(modelFit, test))
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 39 72
## Up 31 110
##
## Accuracy : 0.591
## 95% CI : (0.528, 0.653)
## No Information Rate : 0.722
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.137
## Mcnemar's Test P-Value : 8.1e-05
##
## Sensitivity : 0.557
## Specificity : 0.604
## Pos Pred Value : 0.351
## Neg Pred Value : 0.780
## Prevalence : 0.278
## Detection Rate : 0.155
## Detection Prevalence : 0.440
## Balanced Accuracy : 0.581
##
## 'Positive' Class : Down
##
As you can see, we are able to obtain about 59% accuracy overall, which isn’t very good. However, we have 78% prediction accuracy on ‘Up’ days. Perhaps this means we can trade with confidence when our model predicts and ‘Up’ day in the future. The models are not overly significant because lagged stock market data isn’t usually the best indicator of performance.
To predict Direction
for new values of Lag1-Lag3 we simply use the predict()
function and feed in a data frame of new values.