library(Zelig)
## Loading required package: survival
library(car)
library(carData)
##
## Attaching package: 'carData'
## The following objects are masked from 'package:car':
##
## Guyer, UN, Vocab
library(magrittr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(knitr)
library(pander)
data(Vocab)
str(Vocab)
## 'data.frame': 30351 obs. of 4 variables:
## $ year : num 1974 1974 1974 1974 1974 ...
## $ sex : Factor w/ 2 levels "Female","Male": 2 2 1 1 1 2 2 2 1 1 ...
## $ education : num 14 16 10 10 12 16 17 10 12 11 ...
## $ vocabulary: num 9 9 9 5 8 8 9 5 3 5 ...
## - attr(*, "na.action")=Class 'omit' Named int [1:32115] 1 2 3 4 5 6 7 8 9 10 ...
## .. ..- attr(*, "names")= chr [1:32115] "19720001" "19720002" "19720003" "19720004" ...
head(Vocab)
My data is looking at how number of vocabulary words correct in a 10-word test is effected based on sex and number of years of education completed. I wanted to see whether or not male or females got more words correct then females, as well as if the increase of education led to an increase of number of words correct.
vocabulary2 <- ifelse(Vocab$vocabulary >= 6, 1, 0)
head(vocabulary2)
## [1] 1 1 1 0 1 1
In this model, vocabulary has been recoded so that a 6 or 60% is considering to be passing and anything lower than a 6 is considered to be failing. in a 10 word vocabulary model. 1=passing, 0=failing
logit.vote <- glm(vocabulary2 ~ sex, family = "binomial", data = Vocab)
coef(logit.vote)
## (Intercept) sexMale
## 0.5166771 -0.1095706
slope is -.109
dmeo_df <- Vocab %>%
group_by(sex) %>%
summarise(py1 = mean(vocabulary2)) %>%
mutate(py0 = 1 - py1) %>%
pandoc.table()
##
## --------------------------
## sex py1 py0
## -------- -------- --------
## Female 0.6151 0.3849
##
## Male 0.6151 0.3849
## --------------------------
*Odds: Pw/(1 - Pw) = .6151/(1 - .6151) = 1.598077 Po/(1 - Po) = .6151/(1 - .6151) = 1.598077
Odds Ratio: Pw/(1 - Pw) / Po/(1 - Po) = 1.598077/1.598077 = 1
Logs Odd Ratio log(Pw/(1 - Pw) / Po/(1 - Po)) = log(1.598077/1.598077) = 0*
logit.vote <- glm(vocabulary2 ~ education, family = "binomial", data = Vocab)
coef(logit.vote)
## (Intercept) education
## -3.5064320 0.3109145
slope is .3109
m0 <- glm(vocabulary2 ~ sex, family = binomial, data = Vocab)
summary(m0)
##
## Call:
## glm(formula = vocabulary2 ~ sex, family = binomial, data = Vocab)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4032 -1.3545 0.9673 1.0101 1.0101
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.51668 0.01579 32.73 < 2e-16 ***
## sexMale -0.10957 0.02377 -4.61 4.02e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 40453 on 30350 degrees of freedom
## Residual deviance: 40432 on 30349 degrees of freedom
## AIC: 40436
##
## Number of Fisher Scoring iterations: 4
Males are more likely to not score as high on a vocab test.
m1 <- glm(vocabulary2 ~ sex + education, family = binomial, data = Vocab)
summary(m1)
##
## Call:
## glm(formula = vocabulary2 ~ sex + education, family = binomial,
## data = Vocab)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3962 -1.1760 0.6173 1.0043 2.7096
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.44479 0.06694 -51.463 < 2e-16 ***
## sexMale -0.20051 0.02586 -7.753 8.99e-15 ***
## education 0.31286 0.00519 60.284 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 40453 on 30350 degrees of freedom
## Residual deviance: 35597 on 30348 degrees of freedom
## AIC: 35603
##
## Number of Fisher Scoring iterations: 4
Males are more likely not to score as high as said in the top model, and as education increase they are more likely to get a high percentage correct
m2 <- glm(vocabulary2 ~ sex*education, family = binomial, data = Vocab)
summary(m2)
##
## Call:
## glm(formula = vocabulary2 ~ sex * education, family = binomial,
## data = Vocab)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3844 -1.1785 0.6213 1.0039 2.7304
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.395523 0.090041 -37.711 <2e-16 ***
## sexMale -0.307564 0.134173 -2.292 0.0219 *
## education 0.308913 0.007087 43.591 <2e-16 ***
## sexMale:education 0.008465 0.010410 0.813 0.4161
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 40453 on 30350 degrees of freedom
## Residual deviance: 35597 on 30347 degrees of freedom
## AIC: 35605
##
## Number of Fisher Scoring iterations: 4
There are a few values above that we can see are statistically significant, meaning that their p values fall below .05.As we can see from the data above, the intercept, sexMale, and education are all three statistically significant. For males logs odd ratio of getting the vocabulary words correct is is -0.3. With education logs odds ratio of having high score on the vocabulary test is 0.3. For men with education it is .008.
lmtest::lrtest(m0, m1, m2)
anova(m0, m1, m2, test = "Chisq")
library(texreg)
## Version: 1.36.23
## Date: 2017-03-03
## Author: Philip Leifeld (University of Glasgow)
##
## Please cite the JSS article in your publications -- see citation("texreg").
##
## Attaching package: 'texreg'
## The following object is masked from 'package:magrittr':
##
## extract
htmlreg(list(m0, m1, m2))
| Model 1 | Model 2 | Model 3 | ||
|---|---|---|---|---|
| (Intercept) | 0.52*** | -3.44*** | -3.40*** | |
| (0.02) | (0.07) | (0.09) | ||
| sexMale | -0.11*** | -0.20*** | -0.31* | |
| (0.02) | (0.03) | (0.13) | ||
| education | 0.31*** | 0.31*** | ||
| (0.01) | (0.01) | |||
| sexMale:education | 0.01 | |||
| (0.01) | ||||
| AIC | 40436.15 | 35603.21 | 35604.55 | |
| BIC | 40452.79 | 35628.17 | 35637.83 | |
| Log Likelihood | -20216.07 | -17798.60 | -17798.27 | |
| Deviance | 40432.15 | 35597.21 | 35596.55 | |
| Num. obs. | 30351 | 30351 | 30351 | |
| p < 0.001, p < 0.01, p < 0.05 | ||||
There is good justification in choosing model 2 as the best fit model for this data set. The interaction model in model 3 is not significant and does not present any more information than model 2. The AIC and BIC in model is the lowest as well. When looking at the liklihood in models 2 and 3, they are both nealy identical with only a small .66 reducation in model three. Therefore, model 2 is the best choice.
library(visreg)
visreg(m2, "education", scale = "response")
## Warning: Note that you are attempting to plot a 'main effect' in a model that contains an
## interaction. This is potentially misleading; you may wish to consider using the 'by'
## argument.
## Conditions used in construction of plot
## sex: Female
From this visreg chart above, we see very clearly that with an increase of education, there is also an increase in the number of words correct in a 10 count vocabulary test. We also see that with the increase of education, there is an increase in passing the test at a .6 and above.
visreg(m2, "sex", by = "education", scale = "response")
As we can see from the two charts above, in both male and females with the increase of education in years completed there is also an increase of number of words correct in a vocabulary test and the number of people passing the test at a .6. Also males do get a small percentage of words less correct then females, education increases number of words correct among both sexes.
visreg(m2, "sex")
## Warning: Note that you are attempting to plot a 'main effect' in a model that contains an
## interaction. This is potentially misleading; you may wish to consider using the 'by'
## argument.
## Conditions used in construction of plot
## education: 12
visreg(m2, "education")
## Warning: Note that you are attempting to plot a 'main effect' in a model that contains an
## interaction. This is potentially misleading; you may wish to consider using the 'by'
## argument.
## Conditions used in construction of plot
## sex: Female