Week 2: Class 07/02/2003


Today we’ll surf on a new statistical model: logistic regression


When the output is a binary variable

Let us consider we want to predict “the League” of a player conditional to its hits. The variable “league” is discrete:

Is a factor with levels A (American League) and N (National League) indicating player’s league at the end of 1986


Example

Try to plot the league versus the hits. What can you observe?

library(ISLR)
attach(Hitters)


plot(Hitters$Hits, Hitters$League, main="Scatterplot Example",
   xlab="Hits", ylab="League", pch=19)

boxplot(Hitters$Hits~Hitters$League)


We can see, for instance, that if \(Hits>200\) is more likely that the League is number 1 (American), whereas if \(Hits<50\) seems more likely to be in league 2 (National). In between (i.e, \(50<Hits<200\)) we expect a transition to league 2 to league 1 as the number of hits increase.

Also, we can grasp, by looking at the following scatter plot, that the distribution of the number of hits are not so different whether the player is in League 1 (A) or in League 2 (N). We can say that we don’t expect the number of hits to have a great predictive power on the league of the player

Let’s keep researching on it.


What can be the appropriate research question could we make?

  • Has the number of hits predictive power for the type of league of a player?

In the case we have a binary output variable, we need to introduce a new function called “logistic”:

\[ y=f(x)=\frac{1}{1+e^{-x}} \]


Indeed, we can generalize the function adding some parameters:

\[ y=f(x)=\frac{1}{1+e^{-(a+b\times x)}} \]

Note the importance of the “minus” sign in the exponential.

Exercise: Generate a sequence of values for \(x\) (let’s say \(x\in[-20,20]\)) and write the following logistic function.

\[ y_{i}=\frac{1}{1+e^{-0.5x}} \]

Plot it and try to explain what you get.

x<-seq(-20,20,1)
y<-1/(1+exp(-0.5*x))
plot(x,y)

  • We can see that the logistic curve has two limits: 0 and 1, so:

We can interpret the output of this function as a probability


Note that, if you change the sign of the slope \(b\) the orientation of the logistic also changes:

x<-seq(-20,20,1)
y<-1/(1+exp(0.5*x))
plot(x,y)

In our case, the model we want to estimate is written in this way:

\[ League_{i}=\frac{1}{1+e^{-(\beta_{0}+\beta_{1}hits)}} \]

as you can check, we change the \(x\) or \(a+b\times x\) by a function of hits. We estimate an appropriate model, called logistic regression, to analyze the impact of the hits to the probability of playing in league A or N.

attach(Hitters)
## The following objects are masked from Hitters (pos = 3):
## 
##     Assists, AtBat, CAtBat, CHits, CHmRun, CRBI, CRuns, CWalks,
##     Division, Errors, Hits, HmRun, League, NewLeague, PutOuts, RBI,
##     Runs, Salary, Walks, Years
options(warn=-1)
model <- glm( League ~Hits, data = Hitters, family = binomial)
summary(model)
## 
## Call:
## glm(formula = League ~ Hits, family = binomial, data = Hitters)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2903  -1.1035  -0.9818   1.2149   1.4686  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  0.270376   0.269686   1.003    0.316  
## Hits        -0.004422   0.002449  -1.805    0.071 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 443.95  on 321  degrees of freedom
## Residual deviance: 440.64  on 320  degrees of freedom
## AIC: 444.64
## 
## Number of Fisher Scoring iterations: 4

The coefficients of this model cannot be interpreted as in a linear regression. The are inside a non linear function (the logistic). But, firstly, they’re helpful to see how likely is for a player to be in League 1 or 2. Using the above estimated parameters:

x<-seq(0,200,1) #we create an admissible set of "number of hits"
y<-1/(1+exp(-(0.27-0.004422*x)))
plot(x,y)

as you can see, the model struggles classifying clearly a player. Indeed, if we want to see the logistic curve, we need to do something similar to:

x<-seq(-500,500,1) #we create an admissible set of "number of hits"
y<-1/(1+exp(-(0.27-0.004422*x)))
plot(x,y)

But is impossible for a player to get negative hits. Our conclusion is that, as we saw in the descriptive initial analysis, the number of hits seems to have no predictive power on the league of a player.

Now, think about the interpretation of the parameters: is this interpretation standard?

This model will be useful to forecast PROBABILITY. The output of the model, as you have drawn before is a value between 0 and 1. So it transforms the number of hits of a player to the probability of being in the league labelled with value 1 (which is league A).

Lesson of the day: we have learnt to choose a model that matches with the kind of variables we have!

Week 2: Class 09/02/2003

Today’s question:

Use the data set called “Default”


Which variable is a good predictor for default in a bank: the annual incomes or the monthly credit card balance?


For answering the question, try again to plot both quantitative variables against the qualitative variable \(default\) defined as:

\[ Default=\begin{cases} 1 & \mathrm{the\:customer\:is\:in\:default}\\ 0 & \mathrm{otherwise} \end{cases} \]

library(ISLR)
attach(Default)

boxplot(balance~default)

As we can see, the distribution of the credit card balance is different for people that is considered default against people that is not defaulting. Indeed, the credit card balance seems to be a good predictor.

However, the incomes doesn’t look like a good predictor of the variable default.

boxplot(income~default)

Now, let us to fit a logistic regression model trying to predict default firstly, with balance and then, with income.

  default_n <-ifelse(Default$default == "Yes",1,0) #is usual to create a new numerical dichotomous variable
  new_data<-data.frame(Default,default_n)
  model1_log<-glm(default_n~income, family="binomial", data=new_data)
  summary(model1_log)
## 
## Call:
## glm(formula = default_n ~ income, family = "binomial", data = new_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.2968  -0.2723  -0.2576  -0.2478   2.7111  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.094e+00  1.463e-01 -21.156   <2e-16 ***
## income      -8.353e-06  4.207e-06  -1.985   0.0471 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 2916.7  on 9998  degrees of freedom
## AIC: 2920.7
## 
## Number of Fisher Scoring iterations: 6
  • Question one: is the variable income significative?

If you see the p-value: 0.0471, you reject-at 5% significance level- the null hypothesis that the true parameter equals zero:

  • Question two: is a good predictive variable?

We can compute the logistic curve:

  #Data frame with hp income in ascending order
  new_d <- data.frame(income=seq(min(income), max(income),len=length(income)))
  
  # Fill predicted values using the regression model called model1_log
  prob <- predict(model1_log, new_d, type="response")
  
  # Plot Predicted data and original data points
  
  income_v<-as.numeric(unlist(new_d))
  par(mfrow=c(1,2))
  plot(income, default_n)
  plot(income_v,prob, lwd=2, col="green")

It seems that, even the variable “income” is significant, it doesn’t mean that is a good predictive variable. As we guessed with the previous descriptive analysis. Let’s try with the other variable: balance.

  model2_log<-glm(default_n~balance, family="binomial", data=new_data)
  summary(model2_log)
## 
## Call:
## glm(formula = default_n ~ balance, family = "binomial", data = new_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2697  -0.1465  -0.0589  -0.0221   3.7589  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.065e+01  3.612e-01  -29.49   <2e-16 ***
## balance      5.499e-03  2.204e-04   24.95   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1596.5  on 9998  degrees of freedom
## AIC: 1600.5
## 
## Number of Fisher Scoring iterations: 8

The variable “balance” is has an associated positive coefficient. That means: “the higher the current account balance, the higher the probability of default”. Remember that the coefficient here is interpreted as a weight in the probability of the target variable to get value 1.

  #Data frame with balance in ascending order
  new_i <- data.frame(balance=seq(min(balance), max(balance),len=length(balance)))
  
  # Fill predicted values using regression model
  prob <- predict(model2_log, new_i, type="response")
  
  # Plot Predicted data and original data points
  
  balance_v<-as.numeric(unlist(new_i))
  
  par(mfrow=c(1,2))
  plot(balance, default_n)
  plot(balance_v,prob, lwd=2, col="green")