R Bridge Final Project

Stephen Jones
January 9, 2019

Vocabulary, Education and Gender

Is there a significant relationship between cumulative years of education and vocabulary?

Responses were collected in the General Social Survey from the years 1974 to 2016; data included gender, education in years, and a vocabulary score, scaled from 1 to 10 and interpreted as the number of correct responses on a 10-word test. Data were examined by gender and year. Variables were created to determine consistency in sample composition by year.

Source:

National Opinion Research Center General Social Survey. GSS Cumulative Datafile 1972-2016, downloaded from http://gss.norc.org/.

dataset selected from http://vincentarelbundock.github.io/Rdatasets/ and uploaded to personal github https://raw.githubusercontent.com/sigmasigmaiota/vocab/master/vocab.csv

#clean the workspace.
rm(list=ls())
#getURL is part of RCurl
suppressWarnings(library(RCurl))

#download csv from my github
vocab<-read.csv(text=getURL("https://raw.githubusercontent.com/sigmasigmaiota/vocab/master/vocab.csv"))

Data Exploration

#summary
summary(vocab)

##       X.1              X                 year          sex       
##  Min.   :    1   Min.   :19740001   Min.   :1974   Female:17148  
##  1st Qu.: 7588   1st Qu.:19870112   1st Qu.:1987   Male  :13203  
##  Median :15176   Median :19942104   Median :1994                 
##  Mean   :15176   Mean   :19954597   Mean   :1995                 
##  3rd Qu.:22764   3rd Qu.:20063676   3rd Qu.:2006                 
##  Max.   :30351   Max.   :20162866   Max.   :2016                 
##    education       vocabulary    
##  Min.   : 0.00   Min.   : 0.000  
##  1st Qu.:12.00   1st Qu.: 5.000  
##  Median :12.00   Median : 6.000  
##  Mean   :13.03   Mean   : 6.004  
##  3rd Qu.:15.00   3rd Qu.: 7.000  
##  Max.   :20.00   Max.   :10.000

#remove first two columns which serves as a redundant row counter.
vocab$X.1<-NULL

#check result
colnames(vocab)

## [1] "X"          "year"       "sex"        "education"  "vocabulary"

#check sex descriptives with count function and add to dataset.
suppressWarnings(library('plyr'))

#missing values?
sapply(vocab, function(x) sum(is.na(x)))

##          X       year        sex  education vocabulary 
##          0          0          0          0          0

There are no missing values in the dataset, which contains responses collected in surveys from 1974 to 2016, with an overall mean year of 1995. Overall means for education (years) and vocabulary are 13.03 and 6.00, respectively. A little more than half of participants identify as female.

Below, means for education, vocabulary and response frequency are calculated by year, then calculated for each year by gender. Values are merged into the original dataset with descriptive variable names. Of special interest is sample composition by year.

Data Wrangling

#create summary variables
#frequency by year
FreqByYear<-count(vocab,'year')
names(FreqByYear)[2]<-"ResponsesPerYear"

#percentage variable (composition) by sex and year
FreqBySexYear<-count(vocab,c("year","sex"))
names(FreqBySexYear)[3]<-"ResponsesPerSexYear"
Temp<-join(FreqBySexYear,FreqByYear,by='year',type="left")
Temp$SexPerYear.percent<-round((Temp$ResponsesPerSexYear/Temp$ResponsesPerYear)*100,3)

#create variables in dataset
vocab<-join(vocab,Temp,by=c('year','sex'),type="left")

#create total count and percentage variables
FreqSampleSex<-count(vocab,"sex")
FreqSampleSex$FreqSampleSex.percent<-round((FreqSampleSex$freq/sum(FreqSampleSex$freq))*100,3)
names(FreqSampleSex)[2]<-"FreqSampleSex.freq"
#merge with vocab
vocab<-join(vocab,FreqSampleSex,by="sex",type="left")

#add mean education and score by year for each sex.
MeanEducBySexYr<-aggregate(vocab$education,by=list(vocab$sex,vocab$year),FUN=mean,na.rm=TRUE)
names(MeanEducBySexYr)[3]<-"MeanEducBySexYr"
names(MeanEducBySexYr)[2]<-"year"
names(MeanEducBySexYr)[1]<-"sex"
vocab<-join(vocab,MeanEducBySexYr,by=c('year','sex'),type="left")

#create variable for mean education by year
MeanEducByYr<-aggregate(vocab$education,by=list(vocab$year),FUN=mean,na.rm=TRUE)
names(MeanEducByYr)[2]<-"MeanEducByYr"
names(MeanEducByYr)[1]<-"year"
vocab<-join(vocab,MeanEducByYr,by='year',type="left")

#create variable for add mean education and score by year for each sex.
MeanVocabBySexYr<-aggregate(vocab$vocabulary,by=list(vocab$sex,vocab$year),FUN=mean,na.rm=TRUE)
names(MeanVocabBySexYr)[3]<-"MeanVocabBySexYr"
names(MeanVocabBySexYr)[2]<-"year"
names(MeanVocabBySexYr)[1]<-"sex"
vocab<-join(vocab,MeanVocabBySexYr,by=c('year','sex'),type="left")

#create variable for mean vocabulary score by year
MeanVocabByYr<-aggregate(vocab$vocabulary,by=list(vocab$year),FUN=mean,na.rm=TRUE)
names(MeanVocabByYr)[2]<-"MeanVocabByYr"
names(MeanVocabByYr)[1]<-"year"
vocab<-join(vocab,MeanVocabByYr,by='year',type="left")

#round all values to 3 significant digits.
is.num<-sapply(vocab,is.numeric)
vocab[is.num]<-lapply(vocab[is.num],round,3)

Graphics

The boxplot below shows the mean education for the entire sample over range of years.

boxplot(education~year,
        data=vocab,
        main="Education by Year",
        xlab="year",
        ylab="education in years",
        varwidth=T,
        col="lightblue",
        pars = list(boxwex = 1, staplewex = 1, outwex = 1))

Mean education by year is relatively consistent, raising slightly over time with recent peaks in 2004 and 2015. Total sample size ebbed from 1988-1991.

The boxplot below shows the mean vocabulary score for the entire sample over range of years.

boxplot(vocabulary~year,
        data=vocab,
        main="Vocabulary Score by Year",
        xlab="year",
        ylab="vocabulary",
        varwidth=T,
        col="lightgreen",
        pars = list(boxwex = 1, staplewex = 1, outwex = 1))

Mean vocabulary changes only slightly through the years. While mean education in years increased slightly, mean vocabulary scores didn’t vary.

The two boxplots below offer an alternate view of the education and vocabulary boxplots described above.

suppressWarnings(library(ggplot2))
suppressWarnings(library(ggthemes))

bxplot1<-ggplot(vocab, aes(y=education,x=year,fill=factor(year)))+
         geom_boxplot()+
  theme_bw()+
  ggtitle("Education By Year",subtitle="GSS Cumulative Datafile 1972-2016")+
  xlab("year")+
  ylab("education in years")+
  theme(legend.position="none")

suppressWarnings(print(bxplot1))

bxplot2<-ggplot(vocab, aes(y=vocabulary,x=year,fill=factor(year)))+
         geom_boxplot()+
  theme_bw()+
  ggtitle("Vocabulary Test Score By Year",subtitle="GSS Cumulative Datafile 1972-2016")+
  xlab("year")+
  ylab("Vocabulary Test Score")+
  theme(legend.position="none")

suppressWarnings(print(bxplot2))

The histogram below shows the distribution of vocabulary scores in the sample.

hvoc<-hist(vocab$vocabulary,
           main="Vocabulary Scores",
           xlab="vocabulary score",
           border="blue",
           col="blue",
           prob=T)
lines(density(vocab$vocabulary, adjust=5), lty="dotted", col="darkgreen", lwd=2)

Vocabulary scores are evenly distributed in the sample.

heduc<-hist(vocab$education,
            main="Education in Years",
            xlab="education in years",
            border="blue",
            col="lightblue",
            prob=T)
lines(density(vocab$education, adjust=5), lty="dotted", col="darkgreen", lwd=2)

There is a large chunk of the sample who completed 12 years of school, which corresponds to a high school education with no college.

One opportunity the dataset affords is a glimpse into education disparity between genders by year. The plot below illustrates education disparity by year in the sample

#Education disparity reported by year with mean line
plot2<-ggplot(vocab, aes(x=year, y=MeanEducBySexYr, color=sex))+
  geom_line(size=.75)+
  stat_summary(aes(y=MeanEducByYr,group=1),fun.y=mean,color="gray",geom="line",group=1) + 
  scale_color_wsj("colors6")+
  theme_bw()+
  ggtitle("Education Disparity by Year",subtitle="GSS Cumulative Datafile 1972-2016")+
  xlab("survey year")+
  ylab("education in years, mean")

suppressWarnings(print(plot2))

Until 2004, female participants in the sample indicated less education than males; after 2004, mean education in years reached parity with females surpassing males in 2014.

The plot below shows disparity in vocabulary test scores by year.

#Vocabulary score disparity by year
plot3<-ggplot(vocab, aes(x=year, y=MeanVocabBySexYr, color=sex))+
  geom_line(size=.75)+
  stat_summary(aes(y=MeanVocabByYr,group=1),fun.y=mean,color="gray",geom="line",group=1)+
  scale_color_wsj("colors6")+
  theme_bw()+
  theme(panel.border = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        axis.line = element_line(colour = "black"))+
  ggtitle("Vocabulary Test Score Disparity by Year",subtitle="GSS Cumulative Datafile 1972-2016")+
  xlab("survey year")+
  ylab("vocabulary score, mean")

suppressWarnings(print(plot3))

Females achieved higher mean vocabulary scores in most years. Paired with education levels illustrated earlier, this would indicate a more nuanced relationship between cumulative education and vocabulary test scores.

The bar plot below illustrates mean and range of the sample composition by gender.

#Sample Composition
suppressWarnings(library(data.table))
min<-unique(setDT(vocab)[order(SexPerYear.percent)], by = "sex")
max<-unique(setDT(vocab)[order(-SexPerYear.percent)], by = "sex")
vocab.SexFreqByYear<-rbind(min,max)

plot4<-ggplot()+
  geom_bar(data=FreqSampleSex,aes(x = sex, y = FreqSampleSex.percent, color = sex, fill = sex),stat="identity",alpha=.2) +
  geom_point(data=vocab, aes(x = sex, y = SexPerYear.percent, color = sex, fill = sex))+
  geom_text(data=vocab.SexFreqByYear,
                  aes(x = sex, y = SexPerYear.percent,label = year),position = position_nudge(x = 0.08), 
                  stat="identity", 
                  color = "black", 
                  size = 2.5)+
  guides(color = "none", fill = "none") +
  theme_bw() +
  ggtitle("Sample Composition by Sex",subtitle="GSS Cumulative Datafile 1972-2016")+
  labs(
    x = "gender",
    y = "percent"
  )

suppressWarnings(print(plot4))

Surveys distributed in 1994 and 2008 featured the highest percentage of females and males, respectively. The plot below offers another perspective on sample composition.

#Sample composition by year
plot5<-ggplot(vocab, aes(x=year, y=SexPerYear.percent, color=sex))+
  geom_line(size=.75)+
  stat_summary(aes(y=SexPerYear.percent,group=1),fun.y=mean,color="gray",geom="line",group=1)+
  scale_color_wsj("colors6")+
  theme_bw()+
  theme(panel.border = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        axis.line = element_line(colour = "black"))+
  ggtitle("Sample Composition by Year",subtitle="GSS Cumulative Datafile 1972-2016")+
  xlab("gender")+
  ylab("percent")

suppressWarnings(print(plot5))

The plots below illustrate the number or percentage of responses in the overall sample by gender.

#Sample composition by year
bxplot<-ggplot(vocab, aes(x=sex,y=ResponsesPerSexYear,fill=sex))+
  geom_boxplot()+
  ggtitle("Sample Composition",subtitle="GSS Cumulative Datafile 1972-2016")+
  xlab("year")+
  ylab("number of responses")

suppressWarnings(print(bxplot))

Below, another perspective.

vocab$RoundSPY.percent<-round(vocab$SexPerYear.percent,0)

#Sample composition by year
plot6<-ggplot(vocab, aes(year))+
  geom_bar(aes(fill=sex))+
  geom_text(aes(y=ResponsesPerSexYear,label=RoundSPY.percent),size=2,angle=45)+
  theme_bw()+
  theme(panel.border = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        axis.line = element_line(colour = "black"))+
  ggtitle("Sample Composition by Year, with percentage",subtitle="GSS Cumulative Datafile 1972-2016")+
  xlab("year")+
  ylab("number of responses")

suppressWarnings(print(plot6))

The plot below illustrates the relationship among reported education and vocabulary scores. With the regression line superimposed, it is obvious that as mean education increased, mean vocabulary score increased. This suggests that cumulative education is linked to performance on vocabulary tests.

#plot education vs vocabulary with regression line
plot1<-ggplot(vocab,aes(jitter(education,factor=2),jitter(vocabulary,factor=2),color=sex))+
  geom_point(size=.75,alpha=.2)+
  stat_summary(fun.data=mean_cl_normal) + 
  geom_smooth(method='lm',formula=y~x)+
  scale_color_wsj("colors6")+
  theme_bw()+
  ggtitle("Education & Vocabulary",subtitle="GSS Cumulative Datafile 1972-2016")+
  xlab("education in years")+
  ylab("vocabulary test score")

suppressWarnings(print(plot1))

Conclusions

In the overall sample, a direct relationship exists between the amount of education reported in the General Social Survey and performance on vocabulary test scores; while the relationship is approximately equivalent between genders, females outperformed males in vocabulary test scores in most years, even while males consistenly reported greater years of education nearly all years. Females in all years were more likely to be included in the survey, and survey participants in all years overwhelmingly stopped formal education after graduating from high school. It appears that the relationship between education and vocabulary is mainly driven by those with 12 to 16 years of formal education. More demographic data is needed to fully understand this relationship.