Data Analysis in the Social Sciences

Logistic regression

Let us load a file containing the results of a survey conducted in Spain:

sp <- read.csv("http://math-info.hse.ru/f/2016-17/ps-pep-quant/spanish_data.csv")

Variables:

ideolog: respondents’s position on the ideological spectrum, 1 – right, 0 – left;
age: respondent’s age;
ident_reg: equals 1 if a respondent indentifies themselves with a region (province in Spain);
ident_cntr: equals 1 if a respondent indentifies themselves with a country (Spain as a whole political unit);
male: equals 1 if a respondent is male, 0 – female;
educ: respondent’s level of education (1 – primary school, 4 – higher education);
unempl: equals 1 if a respondent is unemployed, 0 – otherwise.

Logistic model and its interpretation

Suppose we want to check which factors affect the probability of being right at the ideological scale. Our dependent variable in this case is ideolog and it is binary. So, a linear model is not applicable here. We need a logistic regression that is used for such tasks.

For running a logistic model we use the function glm() (from generalized linear models that are not all linear, though) and specify the family binomial (logistic will be taken by default):

logmodel <- glm(data = sp, ideolog ~ age + ident_reg + unempl, family = binomial)

As usual, we can look at the summary for the model:

summary(logmodel)

## 
## Call:
## glm(formula = ideolog ~ age + ident_reg + unempl, family = binomial, 
##     data = sp)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.9855  -0.8979  -0.7489   1.4289   2.1202  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -0.947292   0.300301  -3.154  0.00161 **
## age          0.005488   0.005601   0.980  0.32715   
## ident_reg   -0.903744   0.317277  -2.848  0.00439 **
## unempl      -0.427696   0.236743  -1.807  0.07083 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 726.13  on 599  degrees of freedom
## Residual deviance: 710.22  on 596  degrees of freedom
## AIC: 718.22
## 
## Number of Fisher Scoring iterations: 4

However, as we discussed, the interpretation of coefficients in logistic models is different from linear ones. Judging by this output we can only say which factors significantly affect the propensity of being right and state the character of relationships (positive or negative effects).

Here we have two significant coefficients (of ident_reg and of unempl) that are negative. Thus, we can conclude that

Unemployed people are less prone to be right.
People who identify themselves with a region (province) are less prone to be right.

However, to get more interesting substantial interpretation we should tranform coefficients in this model using an exponent:

exp(coef(logmodel))

## (Intercept)         age   ident_reg      unempl 
##   0.3877898   1.0055032   0.4050501   0.6520099

What have we got here are exponents of model coefficients or odd ratios. Now we can say the following:

All else equal, the chances that a person is right are 0.65 higher for unemployed people. Or, more naturally, the chances that a person is right are 1/0.65=1.54 lower for unemployed people.
All else equal, the chances that a person is right are 0.405 higher for people who identify themselves with a region (province). Or, more naturally, the chances that a person is right are 1/0.405=2.47 lower for people who identify themselves with a region (province).
Although the coefficient of age is insignificant, let us interpret it as well as an example of a numeric variable. All else equal, the chances that a person is right increase 1.0055 times when age increases by one year.

Testing model quality

As we discussed, it is useless to calculate and interpret R-squared for a logistic model since the method of estimation is no longer OLS (ordinary least squares). For logistic models we usually plot a ROC curve (** curve) and compute the area under the curve (AUC). The closer is AUC to 1, the better the quality of a logisitc model is.

Firstly, install the library pROC for a basic ROC curve:

install.packages("pROC")

Secondly, load it:

library(pROC)

Now we can create an object with this curve that already includes AUC:

roc <- roc(sp$ideolog ~ logmodel$fitted.values)
roc

## 
## Call:
## roc.formula(formula = sp$ideolog ~ logmodel$fitted.values)
## 
## Data: logmodel$fitted.values in 424 controls (sp$ideolog 0) < 176 cases (sp$ideolog 1).
## Area under the curve: 0.5946

Comments: sp$ideolog is our binary dependent variable and logmodel$fitted.values are values of probabilities predicted by our model.

The area under curve (AUC) is 0.59 that is not perfect, but still acceptable, greater than the baseline 0.5.

Let us plot this ROC curve and add a text comment with the value of AUC:

plot(roc)
text(x = 1, y = 0.8, "AUC = 0.59")

That is all.

Data Analysis in the Social Sciences

Alla Tambovtseva

Logistic regression

Logistic model and its interpretation

Testing model quality