In The Nature of Vocabulary Acquisition, Robert J. Sternberg of Yale University made a profound statement: > “… vocabulary is probably the best single indicator of a person’s overall level of intelligence. Stated in another way, if one wants a quick and not-too-dirty measure of a person’s psychometrically measured intelligence, and thus has time to give just one brief test of it, vocabulary is generally the best predictor of overall score on a psychometric IQ test. (p. 90)”
How do age group, education level, natality status, and gender influence vocabulary acquisition? Which of these factors better determine the acquisition of a language?
The source data contains analysis of a multifactor observational study, with response given by subject’s score on a vocabulary test, and factors for age group, education level, natality status, gender and year of the survey. Surveys were conducted by the GSS from 1978 to 2016. Other variables were added to measure consistency in sample composition.
Source Data from the General Social Survey (GSS) from the National Opinion Research Center of the University of Chicago. Selected From: http://vincentarelbundock.github.io/Rdatasets/
rm(list=ls())
suppressWarnings(library(magrittr))
suppressWarnings(library(dplyr))
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
theURL <- "https://raw.githubusercontent.com/georg4re/LearningR/master/GSSvocab.csv"
vocab <- read.table(file=theURL, header=TRUE, sep=",")
summary(vocab)
## X year gender nativeBorn
## Min. :1978 Min. :1978 Length:28867 Length:28867
## 1st Qu.:1988 1st Qu.:1988 Class :character Class :character
## Median :1996 Median :1996 Mode :character Mode :character
## Mean :1998 Mean :1997
## 3rd Qu.:2008 3rd Qu.:2008
## Max. :2017 Max. :2016
##
## ageGroup educGroup vocab age
## Length:28867 Length:28867 Min. : 0.000 Min. :18.00
## Class :character Class :character 1st Qu.: 5.000 1st Qu.:32.00
## Mode :character Mode :character Median : 6.000 Median :44.00
## Mean : 5.998 Mean :46.18
## 3rd Qu.: 7.000 3rd Qu.:59.00
## Max. :10.000 Max. :89.00
## NA's :1348 NA's :94
## educ
## Min. : 0.00
## 1st Qu.:12.00
## Median :12.00
## Mean :13.04
## 3rd Qu.:15.00
## Max. :20.00
## NA's :81
Check for missing values
sapply(vocab, function(x) sum(is.na(x)))
## X year gender nativeBorn ageGroup educGroup vocab
## 0 0 0 87 94 81 1348
## age educ
## 94 81
There are 1,348 rows with missing vocabulary information. 94 rows missing age and agegroup information. 81 missing education related info and 87 rows missing nativity based information. We will remove these rows from the data set to ensure sample uniformity.
completeVocab <- vocab[complete.cases(vocab),]
summary(completeVocab)
## X year gender nativeBorn
## Min. :1978 Min. :1978 Length:27360 Length:27360
## 1st Qu.:1989 1st Qu.:1988 Class :character Class :character
## Median :1996 Median :1996 Mode :character Mode :character
## Mean :1998 Mean :1997
## 3rd Qu.:2008 3rd Qu.:2008
## Max. :2017 Max. :2016
## ageGroup educGroup vocab age
## Length:27360 Length:27360 Min. : 0 Min. :18.00
## Class :character Class :character 1st Qu.: 5 1st Qu.:31.00
## Mode :character Mode :character Median : 6 Median :43.00
## Mean : 6 Mean :45.75
## 3rd Qu.: 7 3rd Qu.:59.00
## Max. :10 Max. :89.00
## educ
## Min. : 0.00
## 1st Qu.:12.00
## Median :13.00
## Mean :13.16
## 3rd Qu.:16.00
## Max. :20.00
sapply(completeVocab, function(x) sum(is.na(x)))
## X year gender nativeBorn ageGroup educGroup vocab
## 0 0 0 0 0 0 0
## age educ
## 0 0
There are now no missing values in the data sample, which contains responses collected in surveys from 1978 to 2017. Overall means for education (years) and vocabulary are 13.16 and 6.00, respectively. The overall mean age is 45.75. The summary does not give us the means for gender and nativeBorn.
Below, means for all the factors are calculated by year, then calculated for each year by gender, educational group, ageGroup and natality. Values are merged into the original dataset with descriptive variable names. The sample composition by year will allow us to compare these factors.
#create summary variables
#Create Count and Frequency Variables by Factors
#ResponsesByYear
responsesByYear <- count(completeVocab,year)
names(responsesByYear)[2]<-"ResponsesByYear"
#percentage variable (composition) by gender and year
GenderByYear<-count(completeVocab,year,gender)
names(GenderByYear)[3]<-"GenderByYear"
Temp<-merge(x=GenderByYear,y=responsesByYear,by='year',all.x= TRUE)
Temp$PctGenderYear<-round((Temp$GenderByYear/Temp$ResponsesByYear)*100,3)
#create variables in dataset
completeVocab<-merge(x=completeVocab,y=Temp,by=c('year','gender'),all.x=TRUE)
#Composition Educational Group by Year
EGroupByYear<-count(completeVocab,year, educGroup)
names(EGroupByYear)[3]<-"EducGroupByYear"
#Temp<-merge(x=EGroupByYear, y=responsesByYear,by='year',all.x=TRUE)
EGroupByYear$PctEducGroupYear<-round((EGroupByYear$EducGroupByYear/responsesByYear$ResponsesByYear)*100,3)
#Add to dataset
completeVocab<-merge(x=completeVocab,y=EGroupByYear,by=c('year','educGroup'),all.x=TRUE)
#NativeBorn Composition by Year
NativeByYear<-count(completeVocab,year, nativeBorn)
names(NativeByYear)[3]<-"NativeByYear"
NativeByYear$PctNativeYear<-round((NativeByYear$NativeByYear/responsesByYear$ResponsesByYear)*100,3)
#Add to dataset
completeVocab<-merge(x=completeVocab,y=NativeByYear,by=c('year','nativeBorn'),all.x=TRUE)
#AgeGroup Composition by Year
AgeGroupByYear<-count(completeVocab,year, ageGroup)
names(AgeGroupByYear)[3]<-"AgeGroupByYear"
AgeGroupByYear$PctAgeGroupYear<-round((AgeGroupByYear$AgeGroupByYear/responsesByYear$ResponsesByYear)*100,3)
#Add to dataset
completeVocab<-merge(x=completeVocab,y=AgeGroupByYear,by=c('year','ageGroup'),all.x=TRUE)
#create total count and percentage variables
#Gender Frequency
FreqGender<-count(completeVocab,gender)
names(FreqGender)[2]<-"GenderFrequency"
FreqGender$PctFreqGender<-round((FreqGender$GenderFrequency/sum(FreqGender$GenderFrequency))*100,3)
#Add to dataset
completeVocab<-merge(x=completeVocab,y=FreqGender,by="gender",all.x=TRUE)
#Educational Group Frequency
FreqEGroup<-count(completeVocab,educGroup)
names(FreqEGroup)[2]<-"EducGroupFrequency"
FreqEGroup$PctFreqEGroup<-round((FreqEGroup$EducGroupFrequency/sum(FreqEGroup$EducGroupFrequency))*100,3)
#Add to dataset
completeVocab<-merge(x=completeVocab,y=FreqEGroup,by="educGroup",all.x=TRUE)
#natality Frequency
FreqNB<-count(completeVocab,nativeBorn)
names(FreqNB)[2]<-"NativeFrequency"
FreqNB$PctFreqNB<-round((FreqNB$NativeFrequency/sum(FreqNB$NativeFrequency))*100,3)
#Add to dataset
completeVocab<-merge(x=completeVocab,y=FreqNB,by="nativeBorn",all.x=TRUE)
#AgeGroup Frequency
FreqAgeGroup<-count(completeVocab,ageGroup)
names(FreqAgeGroup)[2]<-"AgeGroupFrequency"
FreqAgeGroup$PctFreqAgeGroup<-round((FreqAgeGroup$AgeGroupFrequency/sum(FreqAgeGroup$AgeGroupFrequency))*100,3)
#Add to dataset
completeVocab<-merge(x=completeVocab,y=FreqAgeGroup,by="ageGroup",all.x=TRUE)
#add mean education and score by year for each factor
#gender.
MeanEducGenderYr<-aggregate(completeVocab$educ,by=list(completeVocab$gender,completeVocab$year),FUN=mean,na.rm=TRUE)
names(MeanEducGenderYr)[3]<-"MeanEducGenderYr"
names(MeanEducGenderYr)[2]<-"year"
names(MeanEducGenderYr)[1]<-"gender"
#Add to dataset
completeVocab<-merge(x=completeVocab,y=MeanEducGenderYr,by=c('year','gender'),all.x=TRUE)
#natality.
MeanNatalityYr<-aggregate(completeVocab$educ,by=list(completeVocab$nativeBorn,completeVocab$year),FUN=mean,na.rm=TRUE)
names(MeanNatalityYr)[3]<-"MeanNatalityYr"
names(MeanNatalityYr)[2]<-"year"
names(MeanNatalityYr)[1]<-"nativeBorn"
#Add to dataset
completeVocab<-merge(x=completeVocab,y=MeanNatalityYr,by=c('year','nativeBorn'),all.x=TRUE)
#ageGroup.
MeanAgeGroupYr<-aggregate(completeVocab$educ,by=list(completeVocab$ageGroup,completeVocab$year),FUN=mean,na.rm=TRUE)
names(MeanAgeGroupYr)[3]<-"MeanAgeGroupYr"
names(MeanAgeGroupYr)[2]<-"year"
names(MeanAgeGroupYr)[1]<-"ageGroup"
completeVocab<-merge(x=completeVocab,y=MeanAgeGroupYr,by=c('year','ageGroup'),all.x=TRUE)
#create variable for mean education by year
MeanEducByYr<-aggregate(completeVocab$educ,by=list(completeVocab$year),FUN=mean,na.rm=TRUE)
names(MeanEducByYr)[2]<-"MeanEducByYr"
names(MeanEducByYr)[1]<-"year"
completeVocab<-merge(x=completeVocab,y=MeanEducByYr,by='year',all.x=TRUE)
#create variables to add mean score by year for each factor
#gender
MeanVocabGenderYr<-aggregate(completeVocab$vocab,by=list(completeVocab$gender,completeVocab$year),FUN=mean,na.rm=TRUE)
names(MeanVocabGenderYr)[3]<-"MeanVocabGenderYr"
names(MeanVocabGenderYr)[2]<-"year"
names(MeanVocabGenderYr)[1]<-"gender"
completeVocab<-merge(x=completeVocab,y=MeanVocabGenderYr,by=c('year','gender'),all.x=TRUE)
#EducGroup
MeanVocabEGroupYr<-aggregate(completeVocab$vocab,by=list(completeVocab$educGroup,completeVocab$year),FUN=mean,na.rm=TRUE)
names(MeanVocabEGroupYr)[3]<-"MeanVocabEGroupYr"
names(MeanVocabEGroupYr)[2]<-"year"
names(MeanVocabEGroupYr)[1]<-"educGroup"
completeVocab<-merge(x=completeVocab,y=MeanVocabEGroupYr,by=c('year','educGroup'),all.x=TRUE)
#Natality
MeanVocabNatalityYr<-aggregate(completeVocab$vocab,by=list(completeVocab$nativeBorn,completeVocab$year),FUN=mean,na.rm=TRUE)
names(MeanVocabNatalityYr)[3]<-"MeanVocabNatalityYr"
names(MeanVocabNatalityYr)[2]<-"year"
names(MeanVocabNatalityYr)[1]<-"nativeBorn"
completeVocab<-merge(x=completeVocab,y=MeanVocabNatalityYr,by=c('year','nativeBorn'),all.x=TRUE)
#AgeGroup
MeanVocabAgeGroupYr<-aggregate(completeVocab$vocab,by=list(completeVocab$ageGroup,completeVocab$year),FUN=mean,na.rm=TRUE)
names(MeanVocabAgeGroupYr)[3]<-"MeanVocabAgeGroupYr"
names(MeanVocabAgeGroupYr)[2]<-"year"
names(MeanVocabAgeGroupYr)[1]<-"ageGroup"
completeVocab<-merge(x=completeVocab,y=MeanVocabAgeGroupYr,by=c('year','ageGroup'),all.x=TRUE)
#create variable for mean vocabulary score by year
MeanVocabByYr<-aggregate(completeVocab$vocab,by=list(completeVocab$year),FUN=mean,na.rm=TRUE)
names(MeanVocabByYr)[2]<-"MeanVocabByYr"
names(MeanVocabByYr)[1]<-"year"
completeVocab<-merge(x=completeVocab,y=MeanVocabByYr,by='year',all.x=TRUE)
#round all values to 3 significant digits.
is.num<-sapply(completeVocab,is.numeric)
completeVocab[is.num]<-lapply(completeVocab[is.num],round,3)
##Graphics First, we will explore the mean education level by year.
library(ggplot2)
library(ggthemes)
bxplot1<-ggplot(completeVocab, aes(y=educ,x=year,fill=factor(year)))+
geom_boxplot()+
theme_bw()+
ggtitle("Education By Year",subtitle="National Opinion Research Center of the University of Chicago. 1978-2017")+
xlab("Year")+
ylab("Years in Education")+
theme(legend.position="none")
print(bxplot1)
Mean education years is relatively consistent. We see an increment around 1993 and peaks on 2003 and 2014. Sample size was inconsistent between 1984 and 1993. Since 1993, the sample size appears to be consistent.
Now, let’s explore the mean vocabulary score for the entire sample accross the years
bxplot1<-ggplot(completeVocab, aes(y=vocab,x=year,fill=factor(year)))+
geom_boxplot()+
theme_bw()+
ggtitle("Vocabulary Score By Year",subtitle="National Opinion Research Center of the University of Chicago. 1978-2017")+
xlab("Year")+
ylab("Score (Vocabulary)")+
theme(legend.position="none")
print(bxplot1)
The Mean vocabulary appears to be steady throughout the years, with small peaks and valleys. It is interesting that the mean vocabulary is very similar while the years in education have steadily increased.
Let’s take a look at the distribution of the vocabulary scores.
hvoc<-hist(completeVocab$vocab,
main="Vocabulary Scores",
xlab="vocabulary score",
border="beige",
col="beige",
prob=T)
lines(density(completeVocab$vocab, adjust=5), lty="dotted", col="darkblue", lwd=2)
The scores appear to be somewhat evenly distributed.
We will now explore the differences in education level and score by the different factors contained in the data set:
#Education comparison by gender reported by year with mean line
plotGender<-ggplot(completeVocab, aes(x=year, y=MeanEducGenderYr, color=gender))+
geom_line(size=1)+
stat_summary(aes(y=MeanEducByYr,group=1),fun=mean,color="darkgray",geom="line",group=1) +
scale_color_wsj("colors6")+
theme_bw()+
ggtitle("Education By Gender by Year",subtitle="National Opinion Research Center of the University of Chicago. 1978-2017")+
xlab("Year")+
ylab("Years in Education, mean")
print(plotGender)
#Score comparison by gender reported by year with mean line
plotGender<-ggplot(completeVocab, aes(x=year, y=MeanVocabGenderYr, color=gender))+
geom_line(size=1)+
stat_summary(aes(y=MeanVocabByYr,group=1),fun=mean,color="darkgray",geom="line",group=1) +
scale_color_wsj("colors6")+
theme_bw()+
ggtitle("Vocabulary By Gender by Year",subtitle="National Opinion Research Center of the University of Chicago. 1978-2017")+
xlab("Year")+
ylab("Vocabulary Score, mean")
print(plotGender)
Looking at this two plots, we can extrapolate that females averaged less schooling than males and obtained better overall results in the Vocabulary test.
Let’s now examine the results by Educational Group
#Education comparison by Educational Group reported by year with mean line
plotGroup<-ggplot(completeVocab, aes(x=year, y=MeanVocabEGroupYr, color=educGroup))+
geom_line(size=1)+
stat_summary(aes(y=MeanVocabByYr,group=1),fun=mean,color="darkgray",geom="line",group=1) +
scale_color_wsj("colors6")+
theme_bw()+
ggtitle("Vocabulary by Educational Group by Year",subtitle="National Opinion Research Center of the University of Chicago. 1978-2017")+
xlab("Year")+
ylab("Vocabulary (Score), mean")
print(plotGroup)
This plot points to a direct correlation between scores and years in school. It also appears to show a decline in overall scores, especially for the top three groups.
#Scores comparison by natality reported by year with mean line
plotNative<-ggplot(completeVocab, aes(x=year, y=MeanVocabNatalityYr, color=nativeBorn))+
geom_line(size=1)+
stat_summary(aes(y=MeanVocabByYr,group=1),fun=mean,color="darkgray",geom="line",group=1) +
scale_color_wsj("colors6")+
theme_bw()+
ggtitle("Vocabulary by Natality by Year",subtitle="National Opinion Research Center of the University of Chicago. 1978-2017")+
xlab("Year")+
ylab("Vocabulary (Scores), mean")
print(plotNative)
Native borns appear to do significantly better than non-natives.
#Scores comparison by Age Group reported by year with mean line
plotAge<-ggplot(completeVocab, aes(x=year, y=MeanVocabAgeGroupYr, color=ageGroup))+
geom_line(size=1)+
stat_summary(aes(y=MeanVocabByYr,group=1),fun=mean,color="darkgray",geom="line",group=1) +
scale_color_wsj("colors6")+
theme_bw()+
ggtitle("Vocabulary by Age Group by Year",subtitle="National Opinion Research Center of the University of Chicago. 1978-2017")+
xlab("Year")+
ylab("Vocabulary (Scores), mean")
print(plotAge)
Although this graph might look a little over the place, we can extract some information from it. It appears the lowest scores belong to the youngest participants and the better scores are consistently obtained by individuals from 40 and up.
Plot Education vs Vocabulary with regression line Gender
#plot education vs vocabulary with regression line
plotGender<-ggplot(completeVocab,aes(jitter(educ,factor=1),jitter(vocab,factor=2),color=gender))+
geom_point(size=.75,alpha=.2)+
stat_summary(fun.data=mean_cl_normal) +
geom_smooth(method='lm',formula=y~x)+
scale_color_wsj("colors6")+
theme_bw()+
ggtitle("Education & Vocabulary",subtitle="National Opinion Research Center of the University of Chicago. 1978-2017")+
xlab("education in years")+
ylab("vocabulary test score")
print(plotGender)
## Warning: Computation failed in `stat_summary()`:
## Hmisc package required for this function
AgeGroup
plotAge<-ggplot(completeVocab,aes(jitter(educ,factor=1),jitter(vocab,factor=2),color=ageGroup))+
geom_point(size=.75,alpha=.2)+
stat_summary(fun.data=mean_cl_normal) +
geom_smooth(method='lm',formula=y~x)+
scale_color_wsj("colors6")+
theme_bw()+
ggtitle("Education & Vocabulary",subtitle="National Opinion Research Center of the University of Chicago. 1978-2017")+
xlab("education in years")+
ylab("vocabulary test score")
print(plotAge)
## Warning: Computation failed in `stat_summary()`:
## Hmisc package required for this function
Natality
plotNative<-ggplot(completeVocab,aes(jitter(educ,factor=1),jitter(vocab,factor=2),color=nativeBorn))+
geom_point(size=.75,alpha=.2)+
stat_summary(fun.data=mean_cl_normal) +
geom_smooth(method='lm',formula=y~x)+
scale_color_wsj("colors6")+
theme_bw()+
ggtitle("Education & Vocabulary",subtitle="National Opinion Research Center of the University of Chicago. 1978-2017")+
xlab("education in years")+
ylab("vocabulary test score")
print(plotNative)
## Warning: Computation failed in `stat_summary()`:
## Hmisc package required for this function
As shown in the different graphs, there appears to be a direct relationship between years in education and vocabulary. Not suprising to see a marked difference in vocabulary between non-native and native speakers, what is surprising is that this difference persist regardless of years of education. In fact, non-native speakers with the max years of education only seem to reach the mean of Vocab scores for the overall sample.
Older respondents also did remarkably better than younger ones. With respondents aged 40 and up consistently getting better marks and 18-29 consistently being below the mean. As education level increased, the differences decreased but the older group maintained an advantage.
Another interesting fact is that females appear to get slightly better marks regardless of for years having obtained less education than their male counterparts.
It appears that the factors that most influence vocabulary acquisition appear to be:
Natality. (Marked difference in scores)
Education Level (Direct relationship between vocabulary and education level)
Age Group (Older respondents scored significantly better)
Gender (Females have a slight advantage than the male counterparts)