In this part, we import the data from spss file into R, and convert our categorical variables into factors; in addition, we create a graphical summary of the response variable anaema.
library(aod)
library(ggplot2)
library(Hmisc)
## Loading required package: survival
## Loading required package: splines
##
## Attaching package: 'survival'
##
## The following object is masked from 'package:aod':
##
## rats
##
## Loading required package: Formula
## Hmisc library by Frank E Harrell Jr
##
## Type library(help='Hmisc'), ?Overview, or ?Hmisc.Overview')
## to see overall documentation.
##
##
## Attaching package: 'Hmisc'
##
## The following object is masked from 'package:survival':
##
## untangle.specials
##
## The following object is masked from 'package:base':
##
## format.pval, round.POSIXt, trunc.POSIXt, units
library(lattice)
mydata <- spss.get("/Users/Sofia/Desktop/anaemia.sav") #import .sav data
## Loading required package: foreign
## Warning: duplicated levels in factors are deprecated
View(mydata)
summary(mydata)
## Warning: duplicated levels in factors are deprecated
## age region educ ceb
## 15-19:139 Central :200 no education:173 0 :134
## 20-24:263 Eastern :210 primary :495 1 :144
## 25-29:193 Northern :108 secondary :129 2-3:220
## 30-34:112 Western :242 2-3: 0
## 35-39: 65 Not de jure resident: 37 4-5:161
## 40-44: 23 4-5: 0
## 45-49: 2 6+ :138
## wt anaema
## Min. : 37.7 not anaemic:493
## 1st Qu.: 52.0 anaemic :304
## Median : 56.8
## Mean : 57.6
## 3rd Qu.: 62.1
## Max. :152.6
##
### Convert to factor
mydata$age <- factor(mydata$age)
mydata$region <- factor(mydata$region)
mydata$edu <- factor(mydata$educ)
mydata$ceb <- factor(mydata$ceb)
## Warning: duplicated levels in factors are deprecated
mydata$anaema <- factor(mydata$anaema)
weight <- as.numeric(mydata$wt)
xtabs(~anaema + factor(age), data = mydata)
## factor(age)
## anaema 15-19 20-24 25-29 30-34 35-39 40-44 45-49
## not anaemic 83 169 117 64 43 16 1
## anaemic 56 94 76 48 22 7 1
xtabs(~anaema + factor(region), data = mydata)
## factor(region)
## anaema Central Eastern Northern Western Not de jure resident
## not anaemic 130 117 63 167 16
## anaemic 70 93 45 75 21
xtabs(~anaema + factor(educ), data = mydata)
## factor(educ)
## anaema no education primary secondary
## not anaemic 95 302 96
## anaemic 78 193 33
xtabs(~anaema + factor(ceb), data = mydata)
## factor(ceb)
## anaema 0 1 2-3 4-5 6+
## not anaemic 81 93 134 102 83
## anaemic 53 51 86 59 55
Plots we created for as graphical summaries of our variables
barchart(mydata$anaema, horizontal = F)
histogram(weight)
bwplot(anaema ~ wt, data = mydata)
bwplot(anaema ~ wt | age, data = mydata)
bwplot(anaema ~ wt | educ, data = mydata)
bwplot(anaema ~ wt | region, data = mydata)
bwplot(anaema ~ wt | ceb, data = mydata)
Analysis
mylogit <- glm(anaema ~ wt + educ + region + ceb + age, family = "binomial",
data = mydata)
summary(mylogit)
##
## Call:
## glm(formula = anaema ~ wt + educ + region + ceb + age, family = "binomial",
## data = mydata)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.452 -0.993 -0.796 1.274 1.892
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.74649 0.59686 1.25 0.21105
## wt -0.01266 0.00942 -1.34 0.17899
## educprimary -0.29834 0.18961 -1.57 0.11562
## educsecondary -0.98489 0.28527 -3.45 0.00056 ***
## regionEastern 0.30273 0.20979 1.44 0.14903
## regionNorthern 0.07350 0.25802 0.28 0.77574
## regionWestern -0.38131 0.21521 -1.77 0.07642 .
## regionNot de jure resident 0.70770 0.37521 1.89 0.05928 .
## ceb1 -0.26528 0.27517 -0.96 0.33502
## ceb2-3 -0.24200 0.30079 -0.80 0.42108
## ceb4-5 -0.61067 0.36654 -1.67 0.09571 .
## ceb6+ -0.34540 0.43203 -0.80 0.42400
## age20-24 -0.00945 0.26770 -0.04 0.97183
## age25-29 0.25480 0.33171 0.77 0.44241
## age30-34 0.46875 0.40075 1.17 0.24213
## age35-39 -0.11237 0.47101 -0.24 0.81144
## age40-44 -0.23305 0.61901 -0.38 0.70656
## age45-49 0.66134 1.50672 0.44 0.66071
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1059.6 on 796 degrees of freedom
## Residual deviance: 1021.7 on 779 degrees of freedom
## AIC: 1058
##
## Number of Fisher Scoring iterations: 4
1 - pchisq(1021.7, 779) ###Likelihood ratio test p-value
## [1] 9.418e-09
1- in the output above, the first thing we see is the call, this is R reminding us what the model we ran was, what options we specified, etc.
2- Next we see the deviance residuals, which are a measure of model fit. This part of output shows the distribution of the deviance residuals for individual cases used in the model. Below we discuss how to use summaries of the deviance statistic to assess model fit. 3- The next part of the output shows the coefficients, their standard errors, the z-statistic (sometimes called a Wald z-statistic), and the associated p-values.The logistic regression coefficients give the change in the log odds of the outcome for a one unit increase in the predictor variable. We can see that education is highly signficant, and ceb is significant at 10%.
4- For a one unit increase in wt, the log odds of being anaemic decreases by 0.013
5- The indicator variables for education, region, ceb, and age have a slightly different interpretation. For example, having attended a secondary school, versus no education, changes the log odds of admission by -0.985.
anova(mylogit, test = "Chisq") ##weight, education and region seem to be significant,
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: anaema
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 796 1060
## wt 1 5.05 795 1055 0.0247 *
## educ 2 10.46 793 1044 0.0054 **
## region 4 14.86 789 1029 0.0050 **
## ceb 4 2.11 785 1027 0.7159
## age 6 5.45 779 1022 0.4881
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
drop1(mylogit, test = "Chisq") ## we can drop wt, ceb, and age!!
## Single term deletions
##
## Model:
## anaema ~ wt + educ + region + ceb + age
## Df Deviance AIC LRT Pr(>Chi)
## <none> 1022 1058
## wt 1 1024 1058 1.88 0.1704
## educ 2 1034 1066 12.64 0.0018 **
## region 4 1038 1066 16.35 0.0026 **
## ceb 4 1025 1053 3.56 0.4687
## age 6 1027 1051 5.45 0.4881
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
add1(mylogit, ~.^2, test = "Chisq")
## Single term additions
##
## Model:
## anaema ~ wt + educ + region + ceb + age
## Df Deviance AIC LRT Pr(>Chi)
## <none> 1022 1058
## wt:educ 2 1019 1059 2.74 0.25
## wt:region 4 1019 1063 2.38 0.67
## wt:ceb 4 1018 1062 4.12 0.39
## wt:age 6 1016 1064 5.63 0.47
## educ:region 8 1012 1064 9.70 0.29
## educ:ceb 8 1018 1070 4.14 0.84
## educ:age 10 1009 1065 12.86 0.23
## region:ceb 16 1010 1078 11.99 0.74
## region:age 20 1007 1083 14.94 0.78
## ceb:age 13 1005 1067 16.31 0.23
search <- step(mylogit, ~.^2)
## Start: AIC=1058
## anaema ~ wt + educ + region + ceb + age
##
## Df Deviance AIC
## - age 6 1027 1051
## - ceb 4 1025 1053
## - wt 1 1024 1058
## <none> 1022 1058
## + wt:educ 2 1019 1059
## + wt:ceb 4 1018 1062
## + wt:region 4 1019 1063
## + educ:region 8 1012 1064
## + wt:age 6 1016 1064
## + educ:age 10 1009 1065
## - region 4 1038 1066
## - educ 2 1034 1066
## + ceb:age 13 1005 1067
## + educ:ceb 8 1018 1070
## + region:ceb 16 1010 1078
## + region:age 20 1007 1083
##
## Step: AIC=1051
## anaema ~ wt + educ + region + ceb
##
## Df Deviance AIC
## - ceb 4 1029 1045
## - wt 1 1029 1051
## <none> 1027 1051
## + wt:educ 2 1024 1052
## + wt:ceb 4 1023 1055
## + wt:region 4 1025 1057
## + age 6 1022 1058
## - region 4 1042 1058
## + educ:region 8 1018 1058
## - educ 2 1039 1059
## + educ:ceb 8 1023 1063
## + region:ceb 16 1015 1071
##
## Step: AIC=1045
## anaema ~ wt + educ + region
##
## Df Deviance AIC
## <none> 1029 1045
## - wt 1 1031 1045
## + wt:educ 2 1026 1046
## + wt:region 4 1027 1051
## + ceb 4 1027 1051
## - region 4 1044 1052
## - educ 2 1040 1052
## + educ:region 8 1020 1052
## + age 6 1025 1053
newlogit <- (glm(anaema ~ wt + educ + region, family = "binomial", data = mydata))
summary(newlogit)
##
## Call:
## glm(formula = anaema ~ wt + educ + region, family = "binomial",
## data = mydata)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.431 -0.997 -0.823 1.273 1.954
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.5302 0.5622 0.94 0.3457
## wt -0.0130 0.0093 -1.40 0.1623
## educprimary -0.2740 0.1843 -1.49 0.1370
## educsecondary -0.8612 0.2675 -3.22 0.0013 **
## regionEastern 0.2920 0.2075 1.41 0.1594
## regionNorthern 0.0772 0.2551 0.30 0.7621
## regionWestern -0.3246 0.2107 -1.54 0.1234
## regionNot de jure resident 0.7534 0.3682 2.05 0.0407 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1059.6 on 796 degrees of freedom
## Residual deviance: 1029.3 on 789 degrees of freedom
## AIC: 1045
##
## Number of Fisher Scoring iterations: 4
anova(newlogit, test = "Chisq")
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: anaema
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 796 1060
## wt 1 5.05 795 1055 0.0247 *
## educ 2 10.46 793 1044 0.0054 **
## region 4 14.86 789 1029 0.0050 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
1 - pchisq(1029.3, 789) ###Likelihood ratio test p-value
## [1] 1.497e-08
add1(newlogit, ~.^2, test = "Chisq") ##no need to add intercation effects to the model.
## Single term additions
##
## Model:
## anaema ~ wt + educ + region
## Df Deviance AIC LRT Pr(>Chi)
## <none> 1029 1045
## wt:educ 2 1026 1046 3.29 0.19
## wt:region 4 1027 1051 2.44 0.65
## educ:region 8 1020 1052 8.85 0.36
mylogit <- (glm(anaema ~ wt + educ + region, family = "binomial", data = mydata))
summary(mylogit)
##
## Call:
## glm(formula = anaema ~ wt + educ + region, family = "binomial",
## data = mydata)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.431 -0.997 -0.823 1.273 1.954
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.5302 0.5622 0.94 0.3457
## wt -0.0130 0.0093 -1.40 0.1623
## educprimary -0.2740 0.1843 -1.49 0.1370
## educsecondary -0.8612 0.2675 -3.22 0.0013 **
## regionEastern 0.2920 0.2075 1.41 0.1594
## regionNorthern 0.0772 0.2551 0.30 0.7621
## regionWestern -0.3246 0.2107 -1.54 0.1234
## regionNot de jure resident 0.7534 0.3682 2.05 0.0407 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1059.6 on 796 degrees of freedom
## Residual deviance: 1029.3 on 789 degrees of freedom
## AIC: 1045
##
## Number of Fisher Scoring iterations: 4
In the output above, we found ceb and age not to be significant, so we remove them from the model; also, we checked to see if interations between variables is significant, and didn't find any significane!
We can obtain the confidence intervals for the coefficient variables. Note that for logistic models, confidence intervals are based on the profiled log-likelihood function. We can also obtain confidence intervals based on just the standard errors by using the default method.
## CIs using profiled log-likelihood
confint(mylogit)
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) -0.54848 1.65825
## wt -0.03175 0.00475
## educprimary -0.63515 0.08811
## educsecondary -1.39365 -0.34303
## regionEastern -0.11407 0.70023
## regionNorthern -0.42544 0.57599
## regionWestern -0.73871 0.08815
## regionNot de jure resident 0.03558 1.48757
## CIs using standard errors
confint.default(mylogit)
## 2.5 % 97.5 %
## (Intercept) -0.57178 1.632100
## wt -0.03122 0.005232
## educprimary -0.63526 0.087188
## educsecondary -1.38557 -0.336920
## regionEastern -0.11472 0.698768
## regionNorthern -0.42272 0.577151
## regionWestern -0.73762 0.088314
## regionNot de jure resident 0.03176 1.475026
We can test for an overall effect of educ using the wald.test function of the aod library. The order in which the coefficients are given in the table of coefficients is the same as the order of the terms in the model. This is important because the wald.test function refers to the coefficients by their order in the model. We use the wald.test function. b supplies the coefficients, while Sigma supplies the variance covariance matrix of the error terms, finally Terms tells R which terms in the model are to be tested.
wald.test(b = coef(mylogit), Sigma = vcov(mylogit), Terms = 3:4)
## Wald test:
## ----------
##
## Chi-squared test:
## X2 = 10.4, df = 2, P(> X2) = 0.0054
The chi-squared test statistic of 12.1, with two degrees of freedom is associated with a p-value of 0.0024 indicating that the overall effect of education is statistically significant.
We can use the wald test on the remaining categorical variables.
wald.test(b = coef(mylogit), Sigma = vcov(mylogit), Terms = 5:8)
## Wald test:
## ----------
##
## Chi-squared test:
## X2 = 14.6, df = 4, P(> X2) = 0.0057
We can also test additional hypotheses about the differences in the coefficients for the different levels of education. Below we test that the coefficient for primary education is equal to the coefficient for secondary education. The first line of code below creates a vector l that defines the test we want to perform. In this case, we want to test the difference (subtraction) of the terms for educ=primary and educ= secondary (i.e., the 3rd and 4th terms in the model). To contrast these two terms, we multiply one of them by 1, and the other by -1. The other terms in the model are not involved in the test, so they are multiplied by 0. The second line of code below uses L=l to tell R that we wish to base the test on the vector l (rather than using the Terms option as we did above).
l <- cbind(0, 0, 1, -1, 0, 0, 0, 0)
wald.test(b = coef(mylogit), Sigma = vcov(mylogit), L = l)
## Wald test:
## ----------
##
## Chi-squared test:
## X2 = 6.6, df = 1, P(> X2) = 0.01
The chi-squared test statistic of 6.6 with 1 degree of freedom is associated with a p-value of 0.01, indicating that the difference between the coefficient for primary education and the coefficient for secondary education is statistically significant.
You can also exponentiate the coefficients and interpret them as odds-ratios. R will do this computation for you. To get the exponentiated coefficients, you tell R that you want to exponentiate (exp), and that the object you want to exponentiate is called coefficients and it is part of mylogit (coef(mylogit)). We can use the same logic to get odds ratios and their confidence intervals, by exponentiating the confidence intervals from before. To put it all in one table, we use cbind to bind the coefficients and confidence intervals column-wise.
## odds ratios only
exp(coef(mylogit))
## (Intercept) wt
## 1.6992 0.9871
## educprimary educsecondary
## 0.7603 0.4226
## regionEastern regionNorthern
## 1.3391 1.0803
## regionWestern regionNot de jure resident
## 0.7228 2.1242
## odds ratios and 95% CI
exp(cbind(OR = coef(mylogit), confint(mylogit)))
## Waiting for profiling to be done...
## OR 2.5 % 97.5 %
## (Intercept) 1.6992 0.5778 5.2501
## wt 0.9871 0.9687 1.0048
## educprimary 0.7603 0.5299 1.0921
## educsecondary 0.4226 0.2482 0.7096
## regionEastern 1.3391 0.8922 2.0142
## regionNorthern 1.0803 0.6535 1.7789
## regionWestern 0.7228 0.4777 1.0922
## regionNot de jure resident 2.1242 1.0362 4.4263
Now we can say that for a one unit increase in wt, the odds of being anaemic (versus not being anaemic) increase by a factor of 0.987 . Note that while R produces it, the odds ratio for the intercept is not generally interpreted.
Checking assumptions of logistic regression
Generalized linear models are freed from the assumption that residuals are normally distributed with equal variance, but the method nevertheless makes important assumptions that should be checked.
Residual plots and other diagnostic plots for glm objects might looks strange and have “stripes” because the data are discrete. Residual plots can nevertheless help to spot severe outliers. Leverage plots can help determine whether any outlying data points exert excessive influence on the parameter estimates. The normal quantile plots are probably not of much use.
plot(mylogit) # deviance residuals, q-q, leverage
Deviance residuals indicate the contribution of each point to the total deviance, which measures overall fit of the model to data. Use jitter to reduce overlap of points in the plot.
plot(predict(mylogit), jitter(residuals(mylogit), amount = 0.05))
Pearson residuals are calculated from the working data on the logit or log scale, but correct for differences between data points in their associated “weight” (inverse of expected variance).
plot(predict(mylogit), jitter(residuals(mylogit, type = "pearson"), amount = 0.05))
Leverage indicates the influence that each data point has on the parameters of the fitted model. A data point has high leverage if removing it from the analysis changes the parameter estimates substantially.
plot(hatvalues(mylogit), jitter(residuals(mylogit, type = "pearson"), amount = 0.05))
Logistic regression assume that the “dispersion parameter” is 1. What this means is that the variance of the residuals are as expected from the binomial or Poisson distributions specified by the link function. In the case of count data, for example, a dispersion parameter of 1 means that the variance of the explanatory variable within each group should be approximately equal to the mean for the group. However, count data are notorious for having higher than expected variances (dispersion parameter > 1). One way to check is to tabulate the mean and variance for each group and see how they compare. Another way is to estimate the dispersion parameter directly by refitting the data to a quasi-likelihood model that includes such a parameter. For binary and count data (logistic) the corresponding quasi-likelihood models are as follows.
mylogit1 <- glm(anaema ~ wt + educ + region, family = quasibinomial(link = "logit"),
data = mydata)
summary(mylogit1)
##
## Call:
## glm(formula = anaema ~ wt + educ + region, family = quasibinomial(link = "logit"),
## data = mydata)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.431 -0.997 -0.823 1.273 1.954
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.53016 0.56523 0.94 0.3486
## wt -0.01299 0.00935 -1.39 0.1649
## educprimary -0.27403 0.18529 -1.48 0.1395
## educsecondary -0.86124 0.26895 -3.20 0.0014 **
## regionEastern 0.29202 0.20864 1.40 0.1620
## regionNorthern 0.07721 0.25644 0.30 0.7634
## regionWestern -0.32465 0.21183 -1.53 0.1258
## regionNot de jure resident 0.75339 0.37016 2.04 0.0422 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasibinomial family taken to be 1.011)
##
## Null deviance: 1059.6 on 796 degrees of freedom
## Residual deviance: 1029.3 on 789 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 4
If the estimated dispersion parameter is much greater than 1 (in our case it's 1), then you should repeat your entire analysis using the quasibinomial link function. This will lead to more reliable standard errors for parameters, and different P-values in tests.
Model selection
Compute log-likelihoods for fitted model.
logLik(mylogit) # log-likelihood of the model fit
## 'log Lik.' -514.6 (df=8)
stepAIC
stepAIC is a command in the MASS library that will automatically carry out a restricted search for the “best” model, as measured by either AIC or BIC (Bayesian Information criterion) (but not AICc, unfortunately). stepAIC is accepts both categorical and numerical variables. It does not search all possible subsets of variables, but rather it searches in a specific sequence determined by the order of the variables in the formula of the model object.
library(MASS)
mylogit2 <- stepAIC(mylogit, upper = ~., lower = ~1, direction = "both")
## Start: AIC=1045
## anaema ~ wt + educ + region
##
## Df Deviance AIC
## <none> 1029 1045
## - wt 1 1031 1045
## - region 4 1044 1052
## - educ 2 1040 1052
summary(mylogit2)
##
## Call:
## glm(formula = anaema ~ wt + educ + region, family = "binomial",
## data = mydata)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.431 -0.997 -0.823 1.273 1.954
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.5302 0.5622 0.94 0.3457
## wt -0.0130 0.0093 -1.40 0.1623
## educprimary -0.2740 0.1843 -1.49 0.1370
## educsecondary -0.8612 0.2675 -3.22 0.0013 **
## regionEastern 0.2920 0.2075 1.41 0.1594
## regionNorthern 0.0772 0.2551 0.30 0.7621
## regionWestern -0.3246 0.2107 -1.54 0.1234
## regionNot de jure resident 0.7534 0.3682 2.05 0.0407 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1059.6 on 796 degrees of freedom
## Residual deviance: 1029.3 on 789 degrees of freedom
## AIC: 1045
##
## Number of Fisher Scoring iterations: 4
plot(mylogit2)
anova(mylogit2)
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: anaema
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev
## NULL 796 1060
## wt 1 5.05 795 1055
## educ 2 10.46 793 1044
## region 4 14.86 789 1029
Now, to interpret the output of stepAIC. Start at the top and follow the sequence of steps that the procedure has carried out. First, it fits the full model you gave it and prints the AIC value. Then it lists the AIC values that result when it drops out each term, one at a time, leaving all other variables in the model. (It starts with the highest-order interactions, if you have included any.) It picks the best model of the bunch tested (the one with the lowest AIC) and then starts again. At each iteration it may also add variables one at a time to see if AIC is lowered further. The process continues until neither adding nor removing any single term results in a lower AIC.