Is there a significant relationship between cumulative years of education and vocabulary?
Responses were collected in the General Social Survey from the years 1974 to 2016; data included gender, education in years, and a vocabulary score, scaled from 1 to 10 and interpreted as the number of correct responses on a 10-word test. Data were examined by gender and year. Variables were created to determine consistency in sample composition by year.
National Opinion Research Center General Social Survey. GSS Cumulative Datafile 1972-2016, downloaded from http://gss.norc.org/.
dataset selected from http://vincentarelbundock.github.io/Rdatasets/ and uploaded to personal github https://raw.githubusercontent.com/sigmasigmaiota/vocab/master/vocab.csv
#clean the workspace.
rm(list=ls())
#getURL is part of RCurl
suppressWarnings(library(RCurl))
#download csv from my github
vocab<-read.csv(text=getURL("https://raw.githubusercontent.com/sigmasigmaiota/vocab/master/vocab.csv"))
#summary
summary(vocab)
## X.1 X year sex
## Min. : 1 Min. :19740001 Min. :1974 Female:17148
## 1st Qu.: 7588 1st Qu.:19870112 1st Qu.:1987 Male :13203
## Median :15176 Median :19942104 Median :1994
## Mean :15176 Mean :19954597 Mean :1995
## 3rd Qu.:22764 3rd Qu.:20063676 3rd Qu.:2006
## Max. :30351 Max. :20162866 Max. :2016
## education vocabulary
## Min. : 0.00 Min. : 0.000
## 1st Qu.:12.00 1st Qu.: 5.000
## Median :12.00 Median : 6.000
## Mean :13.03 Mean : 6.004
## 3rd Qu.:15.00 3rd Qu.: 7.000
## Max. :20.00 Max. :10.000
#remove first two columns which serves as a redundant row counter.
vocab$X.1<-NULL
#check result
colnames(vocab)
## [1] "X" "year" "sex" "education" "vocabulary"
#check sex descriptives with count function and add to dataset.
suppressWarnings(library('plyr'))
#missing values?
sapply(vocab, function(x) sum(is.na(x)))
## X year sex education vocabulary
## 0 0 0 0 0
There are no missing values in the dataset, which contains responses collected in surveys from 1974 to 2016, with an overall mean year of 1995. Overall means for education (years) and vocabulary are 13.03 and 6.00, respectively. A little more than half of participants identify as female.
Below, means for education, vocabulary and response frequency are calculated by year, then calculated for each year by gender. Values are merged into the original dataset with descriptive variable names. Of special interest is sample composition by year.
#create summary variables
#frequency by year
FreqByYear<-count(vocab,'year')
names(FreqByYear)[2]<-"ResponsesPerYear"
#percentage variable (composition) by sex and year
FreqBySexYear<-count(vocab,c("year","sex"))
names(FreqBySexYear)[3]<-"ResponsesPerSexYear"
Temp<-join(FreqBySexYear,FreqByYear,by='year',type="left")
Temp$SexPerYear.percent<-round((Temp$ResponsesPerSexYear/Temp$ResponsesPerYear)*100,3)
#create variables in dataset
vocab<-join(vocab,Temp,by=c('year','sex'),type="left")
#create total count and percentage variables
FreqSampleSex<-count(vocab,"sex")
FreqSampleSex$FreqSampleSex.percent<-round((FreqSampleSex$freq/sum(FreqSampleSex$freq))*100,3)
names(FreqSampleSex)[2]<-"FreqSampleSex.freq"
#merge with vocab
vocab<-join(vocab,FreqSampleSex,by="sex",type="left")
#add mean education and score by year for each sex.
MeanEducBySexYr<-aggregate(vocab$education,by=list(vocab$sex,vocab$year),FUN=mean,na.rm=TRUE)
names(MeanEducBySexYr)[3]<-"MeanEducBySexYr"
names(MeanEducBySexYr)[2]<-"year"
names(MeanEducBySexYr)[1]<-"sex"
vocab<-join(vocab,MeanEducBySexYr,by=c('year','sex'),type="left")
#create variable for mean education by year
MeanEducByYr<-aggregate(vocab$education,by=list(vocab$year),FUN=mean,na.rm=TRUE)
names(MeanEducByYr)[2]<-"MeanEducByYr"
names(MeanEducByYr)[1]<-"year"
vocab<-join(vocab,MeanEducByYr,by='year',type="left")
#create variable for add mean education and score by year for each sex.
MeanVocabBySexYr<-aggregate(vocab$vocabulary,by=list(vocab$sex,vocab$year),FUN=mean,na.rm=TRUE)
names(MeanVocabBySexYr)[3]<-"MeanVocabBySexYr"
names(MeanVocabBySexYr)[2]<-"year"
names(MeanVocabBySexYr)[1]<-"sex"
vocab<-join(vocab,MeanVocabBySexYr,by=c('year','sex'),type="left")
#create variable for mean vocabulary score by year
MeanVocabByYr<-aggregate(vocab$vocabulary,by=list(vocab$year),FUN=mean,na.rm=TRUE)
names(MeanVocabByYr)[2]<-"MeanVocabByYr"
names(MeanVocabByYr)[1]<-"year"
vocab<-join(vocab,MeanVocabByYr,by='year',type="left")
#round all values to 3 significant digits.
is.num<-sapply(vocab,is.numeric)
vocab[is.num]<-lapply(vocab[is.num],round,3)
The boxplot below shows the mean education for the entire sample over range of years.
boxplot(education~year,
data=vocab,
main="Education by Year",
xlab="year",
ylab="education in years",
varwidth=T,
col="lightblue",
pars = list(boxwex = 1, staplewex = 1, outwex = 1))
Mean education by year is relatively consistent, raising slightly over time with recent peaks in 2004 and 2015. Total sample size ebbed from 1988-1991.
The boxplot below shows the mean vocabulary score for the entire sample over range of years.
boxplot(vocabulary~year,
data=vocab,
main="Vocabulary Score by Year",
xlab="year",
ylab="vocabulary",
varwidth=T,
col="lightgreen",
pars = list(boxwex = 1, staplewex = 1, outwex = 1))
Mean vocabulary changes only slightly through the years. While mean education in years increased slightly, mean vocabulary scores didn’t vary.
The two boxplots below offer an alternate view of the education and vocabulary boxplots described above.
suppressWarnings(library(ggplot2))
suppressWarnings(library(ggthemes))
bxplot1<-ggplot(vocab, aes(y=education,x=year,fill=factor(year)))+
geom_boxplot()+
theme_bw()+
ggtitle("Education By Year",subtitle="GSS Cumulative Datafile 1972-2016")+
xlab("year")+
ylab("education in years")+
theme(legend.position="none")
suppressWarnings(print(bxplot1))
bxplot2<-ggplot(vocab, aes(y=vocabulary,x=year,fill=factor(year)))+
geom_boxplot()+
theme_bw()+
ggtitle("Vocabulary Test Score By Year",subtitle="GSS Cumulative Datafile 1972-2016")+
xlab("year")+
ylab("Vocabulary Test Score")+
theme(legend.position="none")
suppressWarnings(print(bxplot2))
The histogram below shows the distribution of vocabulary scores in the sample.
hvoc<-hist(vocab$vocabulary,
main="Vocabulary Scores",
xlab="vocabulary score",
border="blue",
col="blue",
prob=T)
lines(density(vocab$vocabulary, adjust=5), lty="dotted", col="darkgreen", lwd=2)
Vocabulary scores are evenly distributed in the sample.
heduc<-hist(vocab$education,
main="Education in Years",
xlab="education in years",
border="blue",
col="lightblue",
prob=T)
lines(density(vocab$education, adjust=5), lty="dotted", col="darkgreen", lwd=2)
There is a large chunk of the sample who completed 12 years of school, which corresponds to a high school education with no college.
One opportunity the dataset affords is a glimpse into education disparity between genders by year. The plot below illustrates education disparity by year in the sample
#Education disparity reported by year with mean line
plot2<-ggplot(vocab, aes(x=year, y=MeanEducBySexYr, color=sex))+
geom_line(size=.75)+
stat_summary(aes(y=MeanEducByYr,group=1),fun.y=mean,color="gray",geom="line",group=1) +
scale_color_wsj("colors6")+
theme_bw()+
ggtitle("Education Disparity by Year",subtitle="GSS Cumulative Datafile 1972-2016")+
xlab("survey year")+
ylab("education in years, mean")
suppressWarnings(print(plot2))
Until 2004, female participants in the sample indicated less education than males; after 2004, mean education in years reached parity with females surpassing males in 2014.
The plot below shows disparity in vocabulary test scores by year.
#Vocabulary score disparity by year
plot3<-ggplot(vocab, aes(x=year, y=MeanVocabBySexYr, color=sex))+
geom_line(size=.75)+
stat_summary(aes(y=MeanVocabByYr,group=1),fun.y=mean,color="gray",geom="line",group=1)+
scale_color_wsj("colors6")+
theme_bw()+
theme(panel.border = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.line = element_line(colour = "black"))+
ggtitle("Vocabulary Test Score Disparity by Year",subtitle="GSS Cumulative Datafile 1972-2016")+
xlab("survey year")+
ylab("vocabulary score, mean")
suppressWarnings(print(plot3))
Females achieved higher mean vocabulary scores in most years. Paired with education levels illustrated earlier, this would indicate a more nuanced relationship between cumulative education and vocabulary test scores.
The bar plot below illustrates mean and range of the sample composition by gender.
#Sample Composition
suppressWarnings(library(data.table))
min<-unique(setDT(vocab)[order(SexPerYear.percent)], by = "sex")
max<-unique(setDT(vocab)[order(-SexPerYear.percent)], by = "sex")
vocab.SexFreqByYear<-rbind(min,max)
plot4<-ggplot()+
geom_bar(data=FreqSampleSex,aes(x = sex, y = FreqSampleSex.percent, color = sex, fill = sex),stat="identity",alpha=.2) +
geom_point(data=vocab, aes(x = sex, y = SexPerYear.percent, color = sex, fill = sex))+
geom_text(data=vocab.SexFreqByYear,
aes(x = sex, y = SexPerYear.percent,label = year),position = position_nudge(x = 0.08),
stat="identity",
color = "black",
size = 2.5)+
guides(color = "none", fill = "none") +
theme_bw() +
ggtitle("Sample Composition by Sex",subtitle="GSS Cumulative Datafile 1972-2016")+
labs(
x = "gender",
y = "percent"
)
suppressWarnings(print(plot4))
Surveys distributed in 1994 and 2008 featured the highest percentage of females and males, respectively. The plot below offers another perspective on sample composition.
#Sample composition by year
plot5<-ggplot(vocab, aes(x=year, y=SexPerYear.percent, color=sex))+
geom_line(size=.75)+
stat_summary(aes(y=SexPerYear.percent,group=1),fun.y=mean,color="gray",geom="line",group=1)+
scale_color_wsj("colors6")+
theme_bw()+
theme(panel.border = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.line = element_line(colour = "black"))+
ggtitle("Sample Composition by Year",subtitle="GSS Cumulative Datafile 1972-2016")+
xlab("gender")+
ylab("percent")
suppressWarnings(print(plot5))
The plots below illustrate the number or percentage of responses in the overall sample by gender.
#Sample composition by year
bxplot<-ggplot(vocab, aes(x=sex,y=ResponsesPerSexYear,fill=sex))+
geom_boxplot()+
ggtitle("Sample Composition",subtitle="GSS Cumulative Datafile 1972-2016")+
xlab("year")+
ylab("number of responses")
suppressWarnings(print(bxplot))
Below, another perspective.
vocab$RoundSPY.percent<-round(vocab$SexPerYear.percent,0)
#Sample composition by year
plot6<-ggplot(vocab, aes(year))+
geom_bar(aes(fill=sex))+
geom_text(aes(y=ResponsesPerSexYear,label=RoundSPY.percent),size=2,angle=45)+
theme_bw()+
theme(panel.border = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.line = element_line(colour = "black"))+
ggtitle("Sample Composition by Year, with percentage",subtitle="GSS Cumulative Datafile 1972-2016")+
xlab("year")+
ylab("number of responses")
suppressWarnings(print(plot6))
#plot education vs vocabulary with regression line
plot1<-ggplot(vocab,aes(jitter(education,factor=2),jitter(vocabulary,factor=2),color=sex))+
geom_point(size=.75,alpha=.2)+
stat_summary(fun.data=mean_cl_normal) +
geom_smooth(method='lm',formula=y~x)+
scale_color_wsj("colors6")+
theme_bw()+
ggtitle("Education & Vocabulary",subtitle="GSS Cumulative Datafile 1972-2016")+
xlab("education in years")+
ylab("vocabulary test score")
suppressWarnings(print(plot1))
In the overall sample, a direct relationship exists between the amount of education reported in the General Social Survey and performance on vocabulary test scores; while the relationship is approximately equivalent between genders, females outperformed males in vocabulary test scores in most years, even while males consistenly reported greater years of education nearly all years. Females in all years were more likely to be included in the survey, and survey participants in all years overwhelmingly stopped formal education after graduating from high school. It appears that the relationship between education and vocabulary is mainly driven by those with 12 to 16 years of formal education. More demographic data is needed to fully understand this relationship.