Exploring the Data

library(Zelig)
## Loading required package: survival
library(car)
library(carData)
## 
## Attaching package: 'carData'
## The following objects are masked from 'package:car':
## 
##     Guyer, UN, Vocab
library(magrittr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
## 
##     recode
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(knitr)
library(pander)
data(Vocab)

str(Vocab)
## 'data.frame':    30351 obs. of  4 variables:
##  $ year      : num  1974 1974 1974 1974 1974 ...
##  $ sex       : Factor w/ 2 levels "Female","Male": 2 2 1 1 1 2 2 2 1 1 ...
##  $ education : num  14 16 10 10 12 16 17 10 12 11 ...
##  $ vocabulary: num  9 9 9 5 8 8 9 5 3 5 ...
##  - attr(*, "na.action")=Class 'omit'  Named int [1:32115] 1 2 3 4 5 6 7 8 9 10 ...
##   .. ..- attr(*, "names")= chr [1:32115] "19720001" "19720002" "19720003" "19720004" ...

My Data

head(Vocab)

My data is looking at how number of vocabulary words correct in a 10-word test is effected based on sex and number of years of education completed. I wanted to see whether or not male or females got more words correct then females, as well as if the increase of education led to an increase of number of words correct.

Recoding Vocab

vocabulary2 <- ifelse(Vocab$vocabulary >= 6, 1, 0)

head(vocabulary2)
## [1] 1 1 1 0 1 1

In this model, vocabulary has been recoded so that a 6 or 60% is considering to be passing and anything lower than a 6 is considered to be failing. in a 10 word vocabulary model. 1=passing, 0=failing

Obtaining Probabilities

logit.vote <- glm(vocabulary2 ~ sex, family = "binomial", data = Vocab)
coef(logit.vote)
## (Intercept)     sexMale 
##   0.5166771  -0.1095706

slope is -.109

dmeo_df <- Vocab %>%
  group_by(sex) %>%
  summarise(py1 = mean(vocabulary2)) %>%
  mutate(py0 = 1 - py1) %>%
  pandoc.table()
## 
## --------------------------
##   sex      py1      py0   
## -------- -------- --------
##  Female   0.6151   0.3849 
## 
##   Male    0.6151   0.3849 
## --------------------------

*Odds: Pw/(1 - Pw) = .6151/(1 - .6151) = 1.598077 Po/(1 - Po) = .6151/(1 - .6151) = 1.598077

Odds Ratio: Pw/(1 - Pw) / Po/(1 - Po) = 1.598077/1.598077 = 1

Logs Odd Ratio log(Pw/(1 - Pw) / Po/(1 - Po)) = log(1.598077/1.598077) = 0*

logit.vote <- glm(vocabulary2 ~ education, family = "binomial", data = Vocab)
coef(logit.vote)
## (Intercept)   education 
##  -3.5064320   0.3109145

slope is .3109

Model 1 Do men get more words correct then women?

m0 <- glm(vocabulary2 ~ sex, family = binomial, data = Vocab)
summary(m0)
## 
## Call:
## glm(formula = vocabulary2 ~ sex, family = binomial, data = Vocab)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4032  -1.3545   0.9673   1.0101   1.0101  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.51668    0.01579   32.73  < 2e-16 ***
## sexMale     -0.10957    0.02377   -4.61 4.02e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 40453  on 30350  degrees of freedom
## Residual deviance: 40432  on 30349  degrees of freedom
## AIC: 40436
## 
## Number of Fisher Scoring iterations: 4

Males are more likely to not score as high on a vocab test.

Model 2 Sex and Education

m1 <- glm(vocabulary2 ~ sex + education, family = binomial, data = Vocab)
summary(m1)
## 
## Call:
## glm(formula = vocabulary2 ~ sex + education, family = binomial, 
##     data = Vocab)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3962  -1.1760   0.6173   1.0043   2.7096  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.44479    0.06694 -51.463  < 2e-16 ***
## sexMale     -0.20051    0.02586  -7.753 8.99e-15 ***
## education    0.31286    0.00519  60.284  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 40453  on 30350  degrees of freedom
## Residual deviance: 35597  on 30348  degrees of freedom
## AIC: 35603
## 
## Number of Fisher Scoring iterations: 4

Males are more likely not to score as high as said in the top model, and as education increase they are more likely to get a high percentage correct

Model 3 Interaction between Sex and Education

m2 <- glm(vocabulary2 ~ sex*education, family = binomial, data = Vocab)
summary(m2)
## 
## Call:
## glm(formula = vocabulary2 ~ sex * education, family = binomial, 
##     data = Vocab)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3844  -1.1785   0.6213   1.0039   2.7304  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -3.395523   0.090041 -37.711   <2e-16 ***
## sexMale           -0.307564   0.134173  -2.292   0.0219 *  
## education          0.308913   0.007087  43.591   <2e-16 ***
## sexMale:education  0.008465   0.010410   0.813   0.4161    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 40453  on 30350  degrees of freedom
## Residual deviance: 35597  on 30347  degrees of freedom
## AIC: 35605
## 
## Number of Fisher Scoring iterations: 4

There are a few values above that we can see are statistically significant, meaning that their p values fall below .05.As we can see from the data above, the intercept, sexMale, and education are all three statistically significant. For males logs odd ratio of getting the vocabulary words correct is is -0.3. With education logs odds ratio of having high score on the vocabulary test is 0.3. For men with education it is .008.

Liklihood ratio test

lmtest::lrtest(m0, m1, m2)
anova(m0, m1, m2, test = "Chisq")

HTML AIC/BIC Chart

library(texreg)
## Version:  1.36.23
## Date:     2017-03-03
## Author:   Philip Leifeld (University of Glasgow)
## 
## Please cite the JSS article in your publications -- see citation("texreg").
## 
## Attaching package: 'texreg'
## The following object is masked from 'package:magrittr':
## 
##     extract
htmlreg(list(m0, m1, m2))
Statistical models
Model 1 Model 2 Model 3
(Intercept) 0.52*** -3.44*** -3.40***
(0.02) (0.07) (0.09)
sexMale -0.11*** -0.20*** -0.31*
(0.02) (0.03) (0.13)
education 0.31*** 0.31***
(0.01) (0.01)
sexMale:education 0.01
(0.01)
AIC 40436.15 35603.21 35604.55
BIC 40452.79 35628.17 35637.83
Log Likelihood -20216.07 -17798.60 -17798.27
Deviance 40432.15 35597.21 35596.55
Num. obs. 30351 30351 30351
p < 0.001, p < 0.01, p < 0.05

There is good justification in choosing model 2 as the best fit model for this data set. The interaction model in model 3 is not significant and does not present any more information than model 2. The AIC and BIC in model is the lowest as well. When looking at the liklihood in models 2 and 3, they are both nealy identical with only a small .66 reducation in model three. Therefore, model 2 is the best choice.

VISREG Chart for Education

library(visreg)
visreg(m2, "education", scale = "response")
## Warning:   Note that you are attempting to plot a 'main effect' in a model that contains an
##   interaction.  This is potentially misleading; you may wish to consider using the 'by'
##   argument.
## Conditions used in construction of plot
## sex: Female

From this visreg chart above, we see very clearly that with an increase of education, there is also an increase in the number of words correct in a 10 count vocabulary test. We also see that with the increase of education, there is an increase in passing the test at a .6 and above.

VISREG Chart for Sex and Education

visreg(m2, "sex", by = "education", scale = "response")

As we can see from the two charts above, in both male and females with the increase of education in years completed there is also an increase of number of words correct in a vocabulary test and the number of people passing the test at a .6. Also males do get a small percentage of words less correct then females, education increases number of words correct among both sexes.

visreg(m2, "sex")
## Warning:   Note that you are attempting to plot a 'main effect' in a model that contains an
##   interaction.  This is potentially misleading; you may wish to consider using the 'by'
##   argument.
## Conditions used in construction of plot
## education: 12

visreg(m2, "education")
## Warning:   Note that you are attempting to plot a 'main effect' in a model that contains an
##   interaction.  This is potentially misleading; you may wish to consider using the 'by'
##   argument.
## Conditions used in construction of plot
## sex: Female