When we perform the simple linear regression, we do not put any constraint on the response variable. That is, if we are trying to predict \(y\) based on continuous \(x\) with the model \(\hat{y}_i=\hat\beta_0+\hat\beta_1x_i\), then our repsonse \(y\) is continuous within its range and can take any value. Figure 1 shows an example. Blue dots shows the original observations. Red line is a fitted regression line and the gray lines are \(95\%\) confidence band. Here \(x\) is defined in between \(20\) and \(80\). And fitted \(y\) is limited in between \(47.93\) and \(164.48\). Although capped by upper and lower limit, here fitted \(y\) (i.e \(\hat{y}\)) can take any value within this specific range.
Figure 1: Simple linear regression example
Now here the nature of the response does not require any constraint. But this is not the case when we are trying to predict a binary outcome. As an example of the previous case, think about the problem that we are trying to predict a person’s height based on weight (The numbers used above will not be realistic though!) Both height and weight are continuous in nature. But the problem would be a little more complicated if we trying to predict person’s gender based on the height.
hdata
contains the height, weight, age and gender of some adult people. The original data is found here. Here we are using a sample of the data, for ages in between \(20\) and \(60\).
Figure 2: Height boxplot across males and females
Now let us have a look at the boxplots of the heights and weights across males and females, which is shown in Figure 2 and 3. From visual inspection there is clearly a difference in the height and weight across males and females. T test provides statisitcal evidence of this statement. For both height and weight, t.test provided small p-value.
Figure 3: Weight boxplot across males and females
t.test(subset(hdata, Gender=="Male")$height,subset(hdata, Gender=="Female")$height)
##
## Welch Two Sample t-test
##
## data: subset(hdata, Gender == "Male")$height and subset(hdata, Gender == "Female")$height
## t = 17.578, df = 267.69, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 9.958241 12.470433
## sample estimates:
## mean of x mean of y
## 161.1471 149.9328
t.test(hdata$weight~hdata$Gender) ## alternate code
##
## Welch Two Sample t-test
##
## data: hdata$weight by hdata$Gender
## t = -11.374, df = 282.54, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -8.199609 -5.780272
## sample estimates:
## mean in group Female mean in group Male
## 42.50496 49.49490
T test actually is performing a hyothesis testing. The null hypothesis here is the mean across male and female are equal. But from the data we have enough eivdence to reject the null, so the mean across male and female are different. P-value is the significance level. It can be thougt of as \(1-confidence\ level\). Lower it is, higher the evidence of rejecting the null.
But t test assumes a normal distribution of the data. One can look at the histograms and conclude that normal approximation holds good for the two quantities.
Now back to original question. Given this data, can we predict a person’s gender if the person’s height and weight is known? Gender is not a numeric quantiy that we can make it a continuous function of height and/or weight. From the boxplots and the results of t test, it is evident that higher the heigth or weight, it is more liklely for the person to be a guy and vice versa. The primary challenge is how to model a binary response giving some functional form?
For logistic regression, we play with the probabilities. The binary outcomes (male or female) are translated into probability measures, so that at least we can think of a functional form. Now back to Figure 1, we can model the response as a function of predictor. Here the response is continuous and unbounded, that is if \(x\) is spans throughout the real number line, so does \(y\). But this will not work in the bnary outcome case as we are dealing with probability. Instead of the fitted curve in Figure 1, we want a fitted curve something like in Figrue 4.
Figure 4: Logistic Fuction
Now if we model the probability of being male as a function of height, we expect a Figure like Figrue 4. The plot is made taking height 155 at center. Some remarks about the curve:
So, we are almost there. We just want to make a curve like in Figure 5. But how do we do it? The answer is logistic function. It has the floowing functional form:
\[Pr(being\ male)=y=logistic(x)=\frac{1}{1+e^{-x}}\] The above equation has the similar shape as in Figue 5, only for the cnter location. The above is the standard form, centered at zero. The equiton did not come up at random. The properties suggested above requires a Cumulative Probability Density Funciton (CDF) of a random variable defined for all real numbers. The above equation is actually the CDF of standard logistic distribution. To know more about this distribution, see here.
The standard logistic is centered at zero. Actually this is the optimal value, where two groups are separated. But in our example, if we think about height, the optimal separating value would be around \(155\). How to change this? And actually this value will be determined from the data. To accomodate these things, we actually rely on the following equation:
\[Pr(being\ male)= y=logistic(x)=\frac{1}{1+e^{-(\beta_0+\beta_1x)}}\]
We can re-write the eqaution as follows:
\[log\{\frac{Pr(being\ male)}{1-Pr(being\ male)}\} = \beta_0+\beta_1x\]
The term inside log of the left hand side of abive is calles odd. Basically we are here modeling the log odd as a linear combination of the predictors.
Now we will build a logistic regression model for our data, to see whether we can predct a person’s gender given height and weight. We will partition the data into training and testing.
library(caret)
## Warning: package 'caret' was built under R version 3.4.3
## Loading required package: lattice
## Loading required package: ggplot2
hdata$Gender=as.factor(hdata$Gender)
print(contrasts(hdata$Gender))
## Male
## Female 0
## Male 1
trainindex=createDataPartition(hdata$Gender, p=0.6)$Resample1
## Create the trainig and testing set
train=hdata[trainindex,]
test=hdata[-trainindex,]
## Make a model
logisticmodel1=glm(Gender~height, data=train, family=binomial)
print(summary(logisticmodel1))
##
## Call:
## glm(formula = Gender ~ height, family = binomial, data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2958 -0.4609 -0.1217 0.4800 3.3723
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -56.06288 8.28669 -6.765 1.33e-11 ***
## height 0.36063 0.05331 6.765 1.33e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 238.23 on 171 degrees of freedom
## Residual deviance: 119.26 on 170 degrees of freedom
## AIC: 123.26
##
## Number of Fisher Scoring iterations: 6
To work with the model correctly, we first coverted the Gender variable as a factor. When converted, the reference is given to the Female group. So the model will model the probabilty of being male. This reference can be altered by using the relevel
funciton.
Note: Using family="binomial"
in glm does the logistic regression. By default, it chooses logit link. The logit link is for the assumption of logistic distribution of the underlying distribution. One can also choose standard normal. This link is called probit link and can be specified.
Within the training set, we fiited a logistic model. The summary of the model gave a number of information. Let us go over these things one by one:
The functional form of the model. Here we are just passing height as a predictor. So one coefficient (\(\beta_1\)) for this predictor will be estimated. Be default, the intercept (\(\beta_0\)) will be added in the model. So, here we will have two coefficeints to be estimated. The estimattion is performed using a technique called MLE (stands for Maximum Likelihood Estimation).
Refer back to Figure 1, the fitted regression line and the original observations are different. The difference (\(original-fitted\)) is called residual. Similar things happen in the logistic regression. Strictly writing, the model actually has the following form:
\[log\{\frac{Pr(being\ male_i)}{1-Pr(being\ male_i)}\} = \beta_0+\beta_1x_i+\epsilon_i\] Residuals are estiamtes of these \(\epsilon\)’s. Deviance residuals provide some stats of the residuals, namely minimum
, \(1^{st}\) quantile, median, \(3^{rd}\) quantile and the maximum.
A bunch of statistics about the model predictors. In this case, we have two coefficients, one for the intercept and one for the height variable.
The first column gives estimate of the coefficients. We do not know the true values of \(\beta_0\) or \(\beta_1\). But by MLE, these were estimated, which are written with a hat symbol. Here: \[\hat\beta_0=-51.74431\ \text{and } \hat\beta_1= 0.33213\]
Interpretation: The interpretation of the coefficents is a little tricky here. Here \(exp(\hat\beta_0)=3.37079\times 10^{-23}\approx 0\). So, if \(x=0\) (if the height is zero), the odd of being male is almost zero. For this example this is not very practical though. Taking a realistic value, assume that we got \(\beta_0=-0.4\) for some case. For this case, \(exp(-0.4)=0.67\). So the probabilty of being male is (\((1-0.67)\times 100\%=33\%\)) less than the probability of being female, when our predictor \(x\) is set to zero.
For \(\beta_1\), the exponetial value is \(exp(0.33)=1.39\). So, with a unit increase in height, the odd of being a male
increases by \(39\%\).
As being said, \(\hat\beta_0\) and \(\hat\beta_1\) are the estimates of true \(\beta_0\) and \(\beta_1\). In reality, we do not know the true value. Our estimate is not a static number. It has a sampling distribution. That is, this value can range within some specific limit. And probability (probability density, strictly speaking) of taking one sprecific value within the range can be different than the probability of taking other value. This is described by the pdf (probability density function) of our estiamtes.
Recall that we had \(\epsilon\) in our model, which is introducing the variablity of our estimates. If \(\epsilon\) is modeled as normal with zero mean, then it turns out that our estimates have normal disribution as well, based on large sample theory. The Std. Error
associated with the estimates is the Std. Error of the sampling distribution of that estimate (which is normal).
This is the associated z-score
of our estimate, where our Null Hypothesis is the coefficient value is zero. For the intercept, our z score is \(-6.626\). That is, our estimate is \(6.626\) standard deviation away from zero. Now, for a normal distrbution centered at zero, what is the likelihood of getting a value at least this much away.
Figure 5 shows a standard normal curve. The are under the bell curve is \(1\). The shaded area is the probability of being the value in between \(-2\) and \(-6\). This probability is fairly small. Now how would be the area under curve from \(-6.626\) to all the way up to \(-\infty\)? It would be very small, no surprize. Nunerically we can get this probabilty by pnorm(-6.626)
which gives us \(1.724526\times 10^{-11}\). For two sided (\(6.626\) standard deviation away in either direction) probability can be achieved by \(2\times1.724526\times 10^{-11}=3.45\times10^{-11}\). Yes, this is the probability that our estimate of \(\beta_0\) is NOT different from zero.
Figure 5: Standard Normal Curve
Pr(>|z|)
(p-value)This is a magic number for us. This is the probability that our estimate is NOT different from zero. This is obtained from the z-score. The process is described in the preceding part. The strict definition of the p-value is the probability of getting a value as extreme as the estimate, when the Null is true
. Our null hypothesis states that the estimates are zero. Getting a low p-value means we have strong evidence against our null. So for low p-value, we reject the null hypothesis and recognize the predictor as significant to that level. If a p-value is less than \(0.05\), then our significance level is at most \(0.05\) or \(5\%\), our confidence level is at least \(0.95\) or \(95\%\). If a p-value is \(0.01\), then our confidence level of treating that predictor as significant is \(99\%\).
Earlier it was stated that we estimated the parameters by MLE. Residual deviance is related to the likelihood of the model: \[Residual\ deviance=-2log(likelihood\ of\ the\ model)\].
Null deviance is the similar quantity, when no predictor is included in the model.
The degree of freedom can be calcluated by \(Sample\ Size- Number\ of\ parameters\ estimated\). In our case we had dim(train)[1]=172
observations, and we estimated two parameters (\(\beta_0\) and \(\beta_1\)). So the dgree of freedom is \(172-2=170\). For null deviance, we dropped all variables from the model anb just one parameter is to be estiamated (\(\beta_0\)). So the degree of freedom for null deviance is \(171\).
AIC is a metric that is used to make comparison among different models. For now we just have one model, so this is not very handy. But is we happen to have two or more competiting models, then AIC can be used to pick up one. By definition: \(AIC=2g-2\log(L)\), where \(g\) is the number of parameter to be estimated and \(L\) is the likelihood of the model. Since \(log\) is a monotonic increasing function and there is a negative sign before the \(log(L)\), lower AIC indicates better fit.
For simple linear regression, we have a lack of fit test. An equivalent goodness of fit test for losigtic regression is Hosmer-Lemeshow test
. The logistic regression gives us probabilities. These probabilities are divided into several subgroups (\(10\) is a good number) and makes a comparison how many actual observaitons we have in those subgroups. The null hypothesis here is that we have a good fit of the model. Lower p-value indicates lack of fit. Following codes chunk is performing a Hosmer-Lemeshow test for our model with \(10\) groups. We have a high p-value; so not much evidence that we have lack of fit.
library(ResourceSelection)
## Warning: package 'ResourceSelection' was built under R version 3.4.3
## ResourceSelection 0.3-2 2017-02-28
binaryCodes=ifelse(train$Gender=="Male",1,0)
predValues=predict(logisticmodel1, newdata=train, type="response")
hoslem.test(binaryCodes,predValues, g=10)$p.value
## [1] 0.8876258
The second argument of hoslem.test
function is the fitted probabilities. To get the actual probabilities, we have to specify type="response"
. Leaving it default will give the log of odd ratio, that is fitted \(\hat\beta_0+\hat\beta_1x\) values.
Note: The fitted probabilities are also available in the glm object, that is in logisticmodel1. predValues=logisticmodel1$fitted.values
Once we make a model, naturally we want to be interested how our model is performing. Here, in this problem, we made a model that is supposed to predict the gender of a person given the person’s height. The simpliest metric is the model accuracy. Let us do that:
fittedSamplesTrain=predict(logisticmodel1, newdata=train, type="response")
fittedGenderTrain=ifelse(fittedSamplesTrain<0.5,"Female","Male" )
ConfMatTrain=table(train$Gender,fittedGenderTrain)
TrainAccuracy=(ConfMatTrain[1,1]+ConfMatTrain[2,2])/sum(ConfMatTrain)
TrainTPR=72/(72+11)
TrainFPR=10/(10+79)
print(cbind(TrainAccuracy, TrainTPR, TrainFPR))
## TrainAccuracy TrainTPR TrainFPR
## [1,] 0.8546512 0.8674699 0.1123596
Above we derived three statistics related to the performance of our model on the training data. The three statistics are
overall accuracy
true positive rate (TPR)
false positive rate (FPR)
We want the first two measures to be as high as possible and the last one as low as possible. All measures can be calculated from the contingency table
or confusion matrix
. Our model has an overall accuracy of \(87.8\%\). This is the accuracy rate when the model is predicting someone as Male or Female. Besides the overall accuracy, two other measures might be of interest based on the scenario, True positive rate (TPR) and false positive rate (FPR). These can be expressed in Sensitivity and Specificity as well, where the relaions are very simple: \[Sensitivity=true\ positive\ rate\] \[Specificity=1-false\ postive\ rate=true\ negative\ rate\] In our example, TPR is \(86.75\%\). That is, the model is able to pick up \(86.75\%\) of the males (since our target is male here) correctly. Our FPR is \(12.04\%\). That is \(11.23\%\) of all the females were detected to be males, falsely.
The above statics are for our training data. We can make such model that could be work perfectly in our training set, but the performance would drastically be deteriorated in the test set. This can happen when we increase the complexity of the model, that tries hard to fit the training data. But what we really want, that our model performance should not deviate seriously in the test or hold out data. Let us have a look how good we are with our model in the test set:
fittedSamplesTest=predict(logisticmodel1, newdata=test, type="response")
fittedGenderTest=ifelse(fittedSamplesTest<0.5,"Female","Male" )
ConfMatTest=table(test$Gender,fittedGenderTest)
TestAccuracy=(ConfMatTest[1,1]+ConfMatTest[2,2])/sum(ConfMatTest)
TestTPR=45/(45+10)
TestFPR=1-52/(52+7) ## expressed as 1-true negative rate
print(cbind(TestAccuracy, TestTPR, TestFPR))
## TestAccuracy TestTPR TestFPR
## [1,] 0.8596491 0.8181818 0.1186441
We see that our model performance is a little deviated in the test data. We got slighly lower accuracy, lower true postive rate and slightly higher false positive rate. But still, these are not dramatically different than those obtained for the train set. So, our model is doing pretty good.
So far, we played around a single model. That is, we made a model that predicts gender based on height. How about making a model that uses weight as predictor rather than height? Or both? Does it improve our prediction? There are more than one metric that can be used as a comparison purpose among several models. For example, information criterion, error rate (\(1-accuracy\)), area under ROC curve, KS statistic etc. For notes on these topics please visit here.