2024-04-05

Regression

In statistics, regression is a technique to investigate the relationship between one or more variables (assumed to be known with precision) and one variable (usually measured and thus assumed to have some error).

The most common regression technique, often called a “least squares” model, makes various assumptions about the data. If these assumptions are violated, the results may be invalid to various degrees.

Let’s review those assumptions…

Linear model assumptions

  • Outcome variable is continuous and unbounded/wide-ranging
  • Predictors can be continuous, dichotomous, or categorical (i.e. more than 2)
  • Independence
  • Linear
  • Continuous variables are normally distributed without extreme outliers
  • Homoscedasticity (Constant error over the data range)
  • Independence and normality of errors

(Boslaugh S, Statistics in a nutshell, 2nd Ed. )

Linear model assumptions

But what should be done when we wish to predict a dichotomous variable rather than a continuous variable?

Examples include: production of a faulty part/device in a factory, having a disease, or similar questions which are answered with either a yes/no.

This represents a major violation of the assumptions underlying least squares regression models and would make their results invalid.

Answer: Logistic Regression

Logistic regression

Logistic regression uses the logistic transform: a way of converting a dichotomous variable to a continuous one based on the principle of Bernoulli probabilities (i.e. probability of success).

\(p(x) = \frac{1}{1+e^\frac{-(x-\mu)}{s}} = \frac{1}{1+e^{-(\beta_0+\beta_1x)}}\ \ \ \ \ \ \ \ \ \beta_0 = -\mu/s\\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \beta_1 = 1/s\)

Logistic regression

Logistic regression then uses a transform of the logistic function to obtain an equation which resembles what we are used to in least squares regression models. This transform is knows as the “logistic unit” or logit.

The logit is the natural log of the odds ratio: \(\frac{p(x)}{1-p(x)}\)

\(logit(p) = ln (\frac{p}{1-p}) = ln(p)-ln(1-p) = \beta_0 + \beta_1x\)

The model which is fitted generalizes to:

\(logit(p) = \beta_0 + \beta_1x_1 + \beta_2x_2 ... + \beta_nx_n + e\)

when there are more than one predictor variables.

Example

Logistic Regression

Dataset Name: logistic

Summary: Composed of variables which are felt to likely predict admission success to grad school (fictional, Source: UCLA)

Variables:

  • admit (1=admitted, 0=not admitted)
  • gre (Graduate Record Exam Score)
  • gpa (Grad Point Average 0-4)
  • rank (Undergraduate school rank 1-4)

Example Data Set

str(logistic)
## 'data.frame':    400 obs. of  4 variables:
##  $ admit: Factor w/ 2 levels "0","1": 1 2 2 2 1 2 2 1 2 1 ...
##  $ gre  : int  380 660 800 640 520 760 560 400 540 700 ...
##  $ gpa  : num  3.61 3.67 4 3.19 2.93 3 2.98 3.08 3.39 3.92 ...
##  $ rank : Factor w/ 4 levels "1","2","3","4": 3 3 1 4 4 2 1 2 3 2 ...
summary(logistic)
##  admit        gre             gpa        rank   
##  0:273   Min.   :220.0   Min.   :2.260   1: 61  
##  1:127   1st Qu.:520.0   1st Qu.:3.130   2:151  
##          Median :580.0   Median :3.395   3:121  
##          Mean   :587.7   Mean   :3.390   4: 67  
##          3rd Qu.:660.0   3rd Qu.:3.670          
##          Max.   :800.0   Max.   :4.000

Example Data Set - Visualization

GRE vs GPA

Impression: Weak correlation between GPA and GRE score.

Example Data Set - Visualization

GRE/GPA vs school rank

Impression: GRE/GPA not much affected by school rank.

Example Data Set - Visualization

All variables in interactive graph (Big circle = Admitted)

Impression: Unclear if variables predict admission.

Example Data Set - Question

Which variables, if any, are predictive of admission? (GRE?, GPA?, School rank?)

logitreg <- glm (admit ~ gre + gpa + rank, data=logistic, 
                 family = "binomial")

We utilize the glm function of R which stands for “general linear model” and use the parameter family = “binomial” to indicate that we wish to use the logistic/logit relationship.

Now, let’s look at the results of the analysis…

Logistic regression of Example Dataset

summary(logitreg)
## 
## Call:
## glm(formula = admit ~ gre + gpa + rank, family = "binomial", 
##     data = logistic)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.989979   1.139951  -3.500 0.000465 ***
## gre          0.002264   0.001094   2.070 0.038465 *  
## gpa          0.804038   0.331819   2.423 0.015388 *  
## rank2       -0.675443   0.316490  -2.134 0.032829 *  
## rank3       -1.340204   0.345306  -3.881 0.000104 ***
## rank4       -1.551464   0.417832  -3.713 0.000205 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 499.98  on 399  degrees of freedom
## Residual deviance: 458.52  on 394  degrees of freedom
## AIC: 470.52
## 
## Number of Fisher Scoring iterations: 4

Odds ratios

Logistic regression is usually interpreted not in terms of logit, but rather its exponential value, the odds ratio.

## (Intercept)         gre         gpa       rank2       rank3       rank4 
##   0.0185001   1.0022670   2.2345448   0.5089310   0.2617923   0.2119375

This indicates:

  • a 1-point increase in GRE, increases the chance of admission by 1.0023 times

  • a 1-point increase in GPA, more than doubles ones chance of admission (2.2345 times)

  • graduating from a second rank school decreases your chance of admission by (0.5089 times) compared to graduating from a first rank school

  • graduating from third and fourth rank schools decreases the chances even more.

  • From the previous slide, all the variables were statistically singificant predictors of admission

References