Let us load a file containing the results of a survey conducted in Spain:
sp <- read.csv("http://math-info.hse.ru/f/2016-17/ps-pep-quant/spanish_data.csv")
Variables:
ideolog
: respondents’s position on the ideological spectrum, 1 – right, 0 – left;
age
: respondent’s age;
ident_reg
: equals 1 if a respondent indentifies themselves with a region (province in Spain);
ident_cntr
: equals 1 if a respondent indentifies themselves with a country (Spain as a whole political unit);
male
: equals 1 if a respondent is male, 0 – female;
educ
: respondent’s level of education (1 – primary school, 4 – higher education);
unempl
: equals 1 if a respondent is unemployed, 0 – otherwise.
Suppose we want to check which factors affect the probability of being right at the ideological scale. Our dependent variable in this case is ideolog
and it is binary. So, a linear model is not applicable here. We need a logistic regression that is used for such tasks.
For running a logistic model we use the function glm()
(from generalized linear models that are not all linear, though) and specify the family binomial
(logistic will be taken by default):
logmodel <- glm(data = sp, ideolog ~ age + ident_reg + unempl, family = binomial)
As usual, we can look at the summary for the model:
summary(logmodel)
##
## Call:
## glm(formula = ideolog ~ age + ident_reg + unempl, family = binomial,
## data = sp)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.9855 -0.8979 -0.7489 1.4289 2.1202
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.947292 0.300301 -3.154 0.00161 **
## age 0.005488 0.005601 0.980 0.32715
## ident_reg -0.903744 0.317277 -2.848 0.00439 **
## unempl -0.427696 0.236743 -1.807 0.07083 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 726.13 on 599 degrees of freedom
## Residual deviance: 710.22 on 596 degrees of freedom
## AIC: 718.22
##
## Number of Fisher Scoring iterations: 4
However, as we discussed, the interpretation of coefficients in logistic models is different from linear ones. Judging by this output we can only say which factors significantly affect the propensity of being right and state the character of relationships (positive or negative effects).
Here we have two significant coefficients (of ident_reg
and of unempl
) that are negative. Thus, we can conclude that
Unemployed people are less prone to be right.
People who identify themselves with a region (province) are less prone to be right.
However, to get more interesting substantial interpretation we should tranform coefficients in this model using an exponent:
exp(coef(logmodel))
## (Intercept) age ident_reg unempl
## 0.3877898 1.0055032 0.4050501 0.6520099
What have we got here are exponents of model coefficients or odd ratios. Now we can say the following:
All else equal, the chances that a person is right are 0.65 higher for unemployed people. Or, more naturally, the chances that a person is right are 1/0.65=1.54 lower for unemployed people.
All else equal, the chances that a person is right are 0.405 higher for people who identify themselves with a region (province). Or, more naturally, the chances that a person is right are 1/0.405=2.47 lower for people who identify themselves with a region (province).
Although the coefficient of age is insignificant, let us interpret it as well as an example of a numeric variable. All else equal, the chances that a person is right increase 1.0055 times when age increases by one year.
As we discussed, it is useless to calculate and interpret R-squared for a logistic model since the method of estimation is no longer OLS (ordinary least squares). For logistic models we usually plot a ROC curve (** curve) and compute the area under the curve (AUC). The closer is AUC to 1, the better the quality of a logisitc model is.
Firstly, install the library pROC
for a basic ROC curve:
install.packages("pROC")
Secondly, load it:
library(pROC)
Now we can create an object with this curve that already includes AUC:
roc <- roc(sp$ideolog ~ logmodel$fitted.values)
roc
##
## Call:
## roc.formula(formula = sp$ideolog ~ logmodel$fitted.values)
##
## Data: logmodel$fitted.values in 424 controls (sp$ideolog 0) < 176 cases (sp$ideolog 1).
## Area under the curve: 0.5946
Comments: sp$ideolog
is our binary dependent variable and logmodel$fitted.values
are values of probabilities predicted by our model.
The area under curve (AUC) is 0.59 that is not perfect, but still acceptable, greater than the baseline 0.5.
Let us plot this ROC curve and add a text comment with the value of AUC:
plot(roc)
text(x = 1, y = 0.8, "AUC = 0.59")
That is all.