Factors affecting Vocabulary Acquisition

In The Nature of Vocabulary Acquisition, Robert J. Sternberg of Yale University made a profound statement: > “… vocabulary is probably the best single indicator of a person’s overall level of intelligence. Stated in another way, if one wants a quick and not-too-dirty measure of a person’s psychometrically measured intelligence, and thus has time to give just one brief test of it, vocabulary is generally the best predictor of overall score on a psychometric IQ test. (p. 90)”

How do age group, education level, natality status, and gender influence vocabulary acquisition? Which of these factors better determine the acquisition of a language?

The source data contains analysis of a multifactor observational study, with response given by subject’s score on a vocabulary test, and factors for age group, education level, natality status, gender and year of the survey. Surveys were conducted by the GSS from 1978 to 2016. Other variables were added to measure consistency in sample composition.

Source Data from the General Social Survey (GSS) from the National Opinion Research Center of the University of Chicago. Selected From: http://vincentarelbundock.github.io/Rdatasets/

rm(list=ls())
suppressWarnings(library(magrittr))
suppressWarnings(library(dplyr))
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
theURL <- "https://raw.githubusercontent.com/georg4re/LearningR/master/GSSvocab.csv"
vocab <- read.table(file=theURL, header=TRUE, sep=",")

Exploring the Data

summary(vocab)
##        X             year         gender           nativeBorn       
##  Min.   :1978   Min.   :1978   Length:28867       Length:28867      
##  1st Qu.:1988   1st Qu.:1988   Class :character   Class :character  
##  Median :1996   Median :1996   Mode  :character   Mode  :character  
##  Mean   :1998   Mean   :1997                                        
##  3rd Qu.:2008   3rd Qu.:2008                                        
##  Max.   :2017   Max.   :2016                                        
##                                                                     
##    ageGroup          educGroup             vocab             age       
##  Length:28867       Length:28867       Min.   : 0.000   Min.   :18.00  
##  Class :character   Class :character   1st Qu.: 5.000   1st Qu.:32.00  
##  Mode  :character   Mode  :character   Median : 6.000   Median :44.00  
##                                        Mean   : 5.998   Mean   :46.18  
##                                        3rd Qu.: 7.000   3rd Qu.:59.00  
##                                        Max.   :10.000   Max.   :89.00  
##                                        NA's   :1348     NA's   :94     
##       educ      
##  Min.   : 0.00  
##  1st Qu.:12.00  
##  Median :12.00  
##  Mean   :13.04  
##  3rd Qu.:15.00  
##  Max.   :20.00  
##  NA's   :81

Check for missing values

sapply(vocab, function(x) sum(is.na(x)))
##          X       year     gender nativeBorn   ageGroup  educGroup      vocab 
##          0          0          0         87         94         81       1348 
##        age       educ 
##         94         81

There are 1,348 rows with missing vocabulary information. 94 rows missing age and agegroup information. 81 missing education related info and 87 rows missing nativity based information. We will remove these rows from the data set to ensure sample uniformity.

completeVocab <- vocab[complete.cases(vocab),]
summary(completeVocab)
##        X             year         gender           nativeBorn       
##  Min.   :1978   Min.   :1978   Length:27360       Length:27360      
##  1st Qu.:1989   1st Qu.:1988   Class :character   Class :character  
##  Median :1996   Median :1996   Mode  :character   Mode  :character  
##  Mean   :1998   Mean   :1997                                        
##  3rd Qu.:2008   3rd Qu.:2008                                        
##  Max.   :2017   Max.   :2016                                        
##    ageGroup          educGroup             vocab         age       
##  Length:27360       Length:27360       Min.   : 0   Min.   :18.00  
##  Class :character   Class :character   1st Qu.: 5   1st Qu.:31.00  
##  Mode  :character   Mode  :character   Median : 6   Median :43.00  
##                                        Mean   : 6   Mean   :45.75  
##                                        3rd Qu.: 7   3rd Qu.:59.00  
##                                        Max.   :10   Max.   :89.00  
##       educ      
##  Min.   : 0.00  
##  1st Qu.:12.00  
##  Median :13.00  
##  Mean   :13.16  
##  3rd Qu.:16.00  
##  Max.   :20.00
sapply(completeVocab, function(x) sum(is.na(x)))
##          X       year     gender nativeBorn   ageGroup  educGroup      vocab 
##          0          0          0          0          0          0          0 
##        age       educ 
##          0          0

There are now no missing values in the data sample, which contains responses collected in surveys from 1978 to 2017. Overall means for education (years) and vocabulary are 13.16 and 6.00, respectively. The overall mean age is 45.75. The summary does not give us the means for gender and nativeBorn.

Data Preparation

Below, means for all the factors are calculated by year, then calculated for each year by gender, educational group, ageGroup and natality. Values are merged into the original dataset with descriptive variable names. The sample composition by year will allow us to compare these factors.

#create summary variables

#Create Count and Frequency Variables by Factors
#ResponsesByYear
responsesByYear <- count(completeVocab,year)
names(responsesByYear)[2]<-"ResponsesByYear"

#percentage variable (composition) by gender and year
GenderByYear<-count(completeVocab,year,gender)
names(GenderByYear)[3]<-"GenderByYear"
Temp<-merge(x=GenderByYear,y=responsesByYear,by='year',all.x= TRUE)
Temp$PctGenderYear<-round((Temp$GenderByYear/Temp$ResponsesByYear)*100,3)
 
#create variables in dataset
completeVocab<-merge(x=completeVocab,y=Temp,by=c('year','gender'),all.x=TRUE)
 
#Composition Educational Group by Year
EGroupByYear<-count(completeVocab,year, educGroup)
names(EGroupByYear)[3]<-"EducGroupByYear"
#Temp<-merge(x=EGroupByYear, y=responsesByYear,by='year',all.x=TRUE)
EGroupByYear$PctEducGroupYear<-round((EGroupByYear$EducGroupByYear/responsesByYear$ResponsesByYear)*100,3)
 
#Add to dataset
completeVocab<-merge(x=completeVocab,y=EGroupByYear,by=c('year','educGroup'),all.x=TRUE)

#NativeBorn Composition by Year
NativeByYear<-count(completeVocab,year, nativeBorn)
names(NativeByYear)[3]<-"NativeByYear"
NativeByYear$PctNativeYear<-round((NativeByYear$NativeByYear/responsesByYear$ResponsesByYear)*100,3)

#Add to dataset
completeVocab<-merge(x=completeVocab,y=NativeByYear,by=c('year','nativeBorn'),all.x=TRUE)

#AgeGroup Composition by Year
AgeGroupByYear<-count(completeVocab,year, ageGroup)
names(AgeGroupByYear)[3]<-"AgeGroupByYear"
AgeGroupByYear$PctAgeGroupYear<-round((AgeGroupByYear$AgeGroupByYear/responsesByYear$ResponsesByYear)*100,3)

#Add to dataset
completeVocab<-merge(x=completeVocab,y=AgeGroupByYear,by=c('year','ageGroup'),all.x=TRUE)

#create total count and percentage variables
#Gender Frequency
FreqGender<-count(completeVocab,gender)
names(FreqGender)[2]<-"GenderFrequency"
FreqGender$PctFreqGender<-round((FreqGender$GenderFrequency/sum(FreqGender$GenderFrequency))*100,3)
#Add to dataset 
completeVocab<-merge(x=completeVocab,y=FreqGender,by="gender",all.x=TRUE)

#Educational Group Frequency
FreqEGroup<-count(completeVocab,educGroup)
names(FreqEGroup)[2]<-"EducGroupFrequency"
FreqEGroup$PctFreqEGroup<-round((FreqEGroup$EducGroupFrequency/sum(FreqEGroup$EducGroupFrequency))*100,3)

#Add to dataset
completeVocab<-merge(x=completeVocab,y=FreqEGroup,by="educGroup",all.x=TRUE)

#natality Frequency
FreqNB<-count(completeVocab,nativeBorn)
names(FreqNB)[2]<-"NativeFrequency"
FreqNB$PctFreqNB<-round((FreqNB$NativeFrequency/sum(FreqNB$NativeFrequency))*100,3)

#Add to dataset
completeVocab<-merge(x=completeVocab,y=FreqNB,by="nativeBorn",all.x=TRUE)

#AgeGroup Frequency
FreqAgeGroup<-count(completeVocab,ageGroup)
names(FreqAgeGroup)[2]<-"AgeGroupFrequency"
FreqAgeGroup$PctFreqAgeGroup<-round((FreqAgeGroup$AgeGroupFrequency/sum(FreqAgeGroup$AgeGroupFrequency))*100,3)
#Add to dataset
completeVocab<-merge(x=completeVocab,y=FreqAgeGroup,by="ageGroup",all.x=TRUE)

#add mean education and score by year for each factor
#gender.
MeanEducGenderYr<-aggregate(completeVocab$educ,by=list(completeVocab$gender,completeVocab$year),FUN=mean,na.rm=TRUE)
names(MeanEducGenderYr)[3]<-"MeanEducGenderYr"
names(MeanEducGenderYr)[2]<-"year"
names(MeanEducGenderYr)[1]<-"gender"
#Add to dataset
completeVocab<-merge(x=completeVocab,y=MeanEducGenderYr,by=c('year','gender'),all.x=TRUE)

#natality.
MeanNatalityYr<-aggregate(completeVocab$educ,by=list(completeVocab$nativeBorn,completeVocab$year),FUN=mean,na.rm=TRUE)
names(MeanNatalityYr)[3]<-"MeanNatalityYr"
names(MeanNatalityYr)[2]<-"year"
names(MeanNatalityYr)[1]<-"nativeBorn"
#Add to dataset
completeVocab<-merge(x=completeVocab,y=MeanNatalityYr,by=c('year','nativeBorn'),all.x=TRUE)

#ageGroup.
MeanAgeGroupYr<-aggregate(completeVocab$educ,by=list(completeVocab$ageGroup,completeVocab$year),FUN=mean,na.rm=TRUE)
names(MeanAgeGroupYr)[3]<-"MeanAgeGroupYr"
names(MeanAgeGroupYr)[2]<-"year"
names(MeanAgeGroupYr)[1]<-"ageGroup"
completeVocab<-merge(x=completeVocab,y=MeanAgeGroupYr,by=c('year','ageGroup'),all.x=TRUE)

#create variable for mean education by year
MeanEducByYr<-aggregate(completeVocab$educ,by=list(completeVocab$year),FUN=mean,na.rm=TRUE)
names(MeanEducByYr)[2]<-"MeanEducByYr"
names(MeanEducByYr)[1]<-"year"
completeVocab<-merge(x=completeVocab,y=MeanEducByYr,by='year',all.x=TRUE)

#create variables to add mean score by year for each factor
#gender
MeanVocabGenderYr<-aggregate(completeVocab$vocab,by=list(completeVocab$gender,completeVocab$year),FUN=mean,na.rm=TRUE)
names(MeanVocabGenderYr)[3]<-"MeanVocabGenderYr"
names(MeanVocabGenderYr)[2]<-"year"
names(MeanVocabGenderYr)[1]<-"gender"
completeVocab<-merge(x=completeVocab,y=MeanVocabGenderYr,by=c('year','gender'),all.x=TRUE)

#EducGroup
MeanVocabEGroupYr<-aggregate(completeVocab$vocab,by=list(completeVocab$educGroup,completeVocab$year),FUN=mean,na.rm=TRUE)
names(MeanVocabEGroupYr)[3]<-"MeanVocabEGroupYr"
names(MeanVocabEGroupYr)[2]<-"year"
names(MeanVocabEGroupYr)[1]<-"educGroup"
completeVocab<-merge(x=completeVocab,y=MeanVocabEGroupYr,by=c('year','educGroup'),all.x=TRUE)

#Natality
MeanVocabNatalityYr<-aggregate(completeVocab$vocab,by=list(completeVocab$nativeBorn,completeVocab$year),FUN=mean,na.rm=TRUE)
names(MeanVocabNatalityYr)[3]<-"MeanVocabNatalityYr"
names(MeanVocabNatalityYr)[2]<-"year"
names(MeanVocabNatalityYr)[1]<-"nativeBorn"
completeVocab<-merge(x=completeVocab,y=MeanVocabNatalityYr,by=c('year','nativeBorn'),all.x=TRUE)

#AgeGroup
MeanVocabAgeGroupYr<-aggregate(completeVocab$vocab,by=list(completeVocab$ageGroup,completeVocab$year),FUN=mean,na.rm=TRUE)
names(MeanVocabAgeGroupYr)[3]<-"MeanVocabAgeGroupYr"
names(MeanVocabAgeGroupYr)[2]<-"year"
names(MeanVocabAgeGroupYr)[1]<-"ageGroup"
completeVocab<-merge(x=completeVocab,y=MeanVocabAgeGroupYr,by=c('year','ageGroup'),all.x=TRUE)

#create variable for mean vocabulary score by year
MeanVocabByYr<-aggregate(completeVocab$vocab,by=list(completeVocab$year),FUN=mean,na.rm=TRUE)
names(MeanVocabByYr)[2]<-"MeanVocabByYr"
names(MeanVocabByYr)[1]<-"year"
completeVocab<-merge(x=completeVocab,y=MeanVocabByYr,by='year',all.x=TRUE)

#round all values to 3 significant digits.
is.num<-sapply(completeVocab,is.numeric)
completeVocab[is.num]<-lapply(completeVocab[is.num],round,3)

##Graphics First, we will explore the mean education level by year.

library(ggplot2)
library(ggthemes)

bxplot1<-ggplot(completeVocab, aes(y=educ,x=year,fill=factor(year)))+
         geom_boxplot()+
  theme_bw()+
  ggtitle("Education By Year",subtitle="National Opinion Research Center of the University of Chicago. 1978-2017")+
  xlab("Year")+
  ylab("Years in Education")+
  theme(legend.position="none")

print(bxplot1)

Mean education years is relatively consistent. We see an increment around 1993 and peaks on 2003 and 2014. Sample size was inconsistent between 1984 and 1993. Since 1993, the sample size appears to be consistent.

Now, let’s explore the mean vocabulary score for the entire sample accross the years

bxplot1<-ggplot(completeVocab, aes(y=vocab,x=year,fill=factor(year)))+
         geom_boxplot()+
  theme_bw()+
  ggtitle("Vocabulary Score By Year",subtitle="National Opinion Research Center of the University of Chicago. 1978-2017")+
  xlab("Year")+
  ylab("Score (Vocabulary)")+
  theme(legend.position="none")

print(bxplot1)

The Mean vocabulary appears to be steady throughout the years, with small peaks and valleys. It is interesting that the mean vocabulary is very similar while the years in education have steadily increased.

Let’s take a look at the distribution of the vocabulary scores.

hvoc<-hist(completeVocab$vocab,
           main="Vocabulary Scores",
           xlab="vocabulary score",
           border="beige",
           col="beige",
           prob=T)
lines(density(completeVocab$vocab, adjust=5), lty="dotted", col="darkblue", lwd=2) 

The scores appear to be somewhat evenly distributed.

We will now explore the differences in education level and score by the different factors contained in the data set:

Factor Comparison

#Education comparison by gender reported by year with mean line
plotGender<-ggplot(completeVocab, aes(x=year, y=MeanEducGenderYr, color=gender))+
  geom_line(size=1)+
  stat_summary(aes(y=MeanEducByYr,group=1),fun=mean,color="darkgray",geom="line",group=1) + 
  scale_color_wsj("colors6")+
  theme_bw()+
  ggtitle("Education By Gender by Year",subtitle="National Opinion Research Center of the University of Chicago. 1978-2017")+
  xlab("Year")+
  ylab("Years in Education, mean")

print(plotGender)

#Score comparison by gender reported by year with mean line
plotGender<-ggplot(completeVocab, aes(x=year, y=MeanVocabGenderYr, color=gender))+
  geom_line(size=1)+
  stat_summary(aes(y=MeanVocabByYr,group=1),fun=mean,color="darkgray",geom="line",group=1) + 
  scale_color_wsj("colors6")+
  theme_bw()+
  ggtitle("Vocabulary By Gender by Year",subtitle="National Opinion Research Center of the University of Chicago. 1978-2017")+
  xlab("Year")+
  ylab("Vocabulary Score, mean")

print(plotGender)

Looking at this two plots, we can extrapolate that females averaged less schooling than males and obtained better overall results in the Vocabulary test.

Educational Groups

Let’s now examine the results by Educational Group

#Education comparison by Educational Group reported by year with mean line
plotGroup<-ggplot(completeVocab, aes(x=year, y=MeanVocabEGroupYr, color=educGroup))+
  geom_line(size=1)+
  stat_summary(aes(y=MeanVocabByYr,group=1),fun=mean,color="darkgray",geom="line",group=1) + 
  scale_color_wsj("colors6")+
  theme_bw()+
  ggtitle("Vocabulary by Educational Group by Year",subtitle="National Opinion Research Center of the University of Chicago. 1978-2017")+
  xlab("Year")+
  ylab("Vocabulary (Score), mean")

print(plotGroup)

This plot points to a direct correlation between scores and years in school. It also appears to show a decline in overall scores, especially for the top three groups.

#Scores comparison by natality reported by year with mean line
plotNative<-ggplot(completeVocab, aes(x=year, y=MeanVocabNatalityYr, color=nativeBorn))+
  geom_line(size=1)+
  stat_summary(aes(y=MeanVocabByYr,group=1),fun=mean,color="darkgray",geom="line",group=1) + 
  scale_color_wsj("colors6")+
  theme_bw()+
  ggtitle("Vocabulary by Natality by Year",subtitle="National Opinion Research Center of the University of Chicago. 1978-2017")+
  xlab("Year")+
  ylab("Vocabulary (Scores), mean")

print(plotNative)

Native borns appear to do significantly better than non-natives.

#Scores comparison by Age Group reported by year with mean line
plotAge<-ggplot(completeVocab, aes(x=year, y=MeanVocabAgeGroupYr, color=ageGroup))+
  geom_line(size=1)+
  stat_summary(aes(y=MeanVocabByYr,group=1),fun=mean,color="darkgray",geom="line",group=1) + 
  scale_color_wsj("colors6")+
  theme_bw()+
  ggtitle("Vocabulary by Age Group by Year",subtitle="National Opinion Research Center of the University of Chicago. 1978-2017")+
  xlab("Year")+
  ylab("Vocabulary (Scores), mean")

print(plotAge)

Although this graph might look a little over the place, we can extract some information from it. It appears the lowest scores belong to the youngest participants and the better scores are consistently obtained by individuals from 40 and up.

Plot Education vs Vocabulary with regression line Gender

#plot education vs vocabulary with regression line
plotGender<-ggplot(completeVocab,aes(jitter(educ,factor=1),jitter(vocab,factor=2),color=gender))+
  geom_point(size=.75,alpha=.2)+
  stat_summary(fun.data=mean_cl_normal) + 
  geom_smooth(method='lm',formula=y~x)+
  scale_color_wsj("colors6")+
  theme_bw()+
  ggtitle("Education & Vocabulary",subtitle="National Opinion Research Center of the University of Chicago. 1978-2017")+
  xlab("education in years")+
  ylab("vocabulary test score")

print(plotGender)
## Warning: Computation failed in `stat_summary()`:
## Hmisc package required for this function

AgeGroup

plotAge<-ggplot(completeVocab,aes(jitter(educ,factor=1),jitter(vocab,factor=2),color=ageGroup))+
  geom_point(size=.75,alpha=.2)+
  stat_summary(fun.data=mean_cl_normal) + 
  geom_smooth(method='lm',formula=y~x)+
  scale_color_wsj("colors6")+
  theme_bw()+
  ggtitle("Education & Vocabulary",subtitle="National Opinion Research Center of the University of Chicago. 1978-2017")+
  xlab("education in years")+
  ylab("vocabulary test score")

print(plotAge)
## Warning: Computation failed in `stat_summary()`:
## Hmisc package required for this function

Natality

plotNative<-ggplot(completeVocab,aes(jitter(educ,factor=1),jitter(vocab,factor=2),color=nativeBorn))+
  geom_point(size=.75,alpha=.2)+
  stat_summary(fun.data=mean_cl_normal) + 
  geom_smooth(method='lm',formula=y~x)+
  scale_color_wsj("colors6")+
  theme_bw()+
  ggtitle("Education & Vocabulary",subtitle="National Opinion Research Center of the University of Chicago. 1978-2017")+
  xlab("education in years")+
  ylab("vocabulary test score")

print(plotNative)
## Warning: Computation failed in `stat_summary()`:
## Hmisc package required for this function

Conclusions

As shown in the different graphs, there appears to be a direct relationship between years in education and vocabulary. Not suprising to see a marked difference in vocabulary between non-native and native speakers, what is surprising is that this difference persist regardless of years of education. In fact, non-native speakers with the max years of education only seem to reach the mean of Vocab scores for the overall sample.

Older respondents also did remarkably better than younger ones. With respondents aged 40 and up consistently getting better marks and 18-29 consistently being below the mean. As education level increased, the differences decreased but the older group maintained an advantage.

Another interesting fact is that females appear to get slightly better marks regardless of for years having obtained less education than their male counterparts.

It appears that the factors that most influence vocabulary acquisition appear to be:

  1. Natality. (Marked difference in scores)

  2. Education Level (Direct relationship between vocabulary and education level)

  3. Age Group (Older respondents scored significantly better)

  4. Gender (Females have a slight advantage than the male counterparts)