Abstract

This report summarizes the vocabulary test scores of over 20,000 students and how they are affected based on the year the data was recorded, the gender of the students, and the years of education each student took part in. We investigate three hypotheses in this statistical study. The first hypothesis states that students who were involved in higher levels of studies, managed to receive higher scores in the vocabulary test. The second is, vocabulary test scores have not improved in a clear upward trend over the past 30 years. Our final hypothesis is that females have a more developed vocabulary despite other factors such as education. After analyzing the trends in vocabulary scores of over 20,000 students over the span of 30 years, a few key points were concluded about these hyptheses. Our study revealed that although education did genereally play a positively significant role on vocabulary scores, the gender of the student could outweigh these affects. The males in this sample on average had more years of education, however females had the higher average vocabulary score. No significant evidence was found to prove that vocabulary of students has developed over the years. Implications of these findings, along with the methods used, are further discussed.

Introduction

Vocabulary tests have been a standardized means of assessing the development of word acquisition in individuals. Several factors can play a role in how fast this development occurs. In this study, we will be zooming in on three particular factors: gender, education level and the year the test was taken.

We hypothesize that both factors, education and gender, will independently play a significant role in vocabulary scores. It is trivial that ones development in vocabulary improves as their education progresses. You are not only exposed to more words but also have a better ability to understand and register more advanced words as you grow smarter. This concept theoretically applies to all students regardless of gender. However, Turcotte (2011) states that as women progressed in education within the past few decades, studies show that they generally do better than men. This means that the ability to grasp concepts is stronger in females. In fact, X. fang Xia (2013) concluded from her study that women are found to pay more attention to language than men. In comparison to men, women make less pronunciation errors, use better terminology, are more considerate of different languages, and so forth.

In hindsight, vocabulary tests are a way of assessing how well individuals grasp certain words. Given that studies point towards better grasping abilities in females and in higher educated individuals, it is trivial that these two groups will also have higher vocabulary scores.

One factor we hypothesize will not have an affect on vocabulary scores, is the year in which the test was taken. According to the Barshay (2013), the test scores of students have not progressed since the 1970s. Scores did have the tendency to fluctuate slightly, however, overall there have been very little improvement.

Methods

The first step we took was to summarize the mean vocabulary scores grouped by each individual variable: education, sex and the year in which they were surveyed. Then we graphed the trends of the means for education and the means for the year surveyed. We compared these graphed trends to the graphs produced from the raw data given by each corresponding variable.

In order to gain a general overview of how the factors work when it comes to affecting the vocabulary scores, we ran a multiple regression. Later on we conducted several linear regressions for each individual variable versus vocabulary score. We further continued our investigation on gender by conducting anova and a two-sample t-test.

Results

Exploratory analysis

## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats
## Parsed with column specification:
## cols(
##   year = col_integer(),
##   sex = col_character(),
##   education = col_integer(),
##   vocabulary = col_integer()
## )
  1. Gather summary of means for each variable and arrange by mean:
sum_edu = vocab %>% group_by(education) %>%
  summarize(mean=mean(vocabulary),md=median(vocabulary)) %>% arrange(mean) 
sum_edu
## # A tibble: 21 x 3
##    education     mean    md
##        <int>    <dbl> <dbl>
##  1         1 2.333333     2
##  2         4 2.857143     3
##  3         3 3.000000     3
##  4         5 3.212389     3
##  5         0 3.387097     3
##  6         6 3.650655     4
##  7         2 3.750000     4
##  8         7 4.069401     4
##  9         9 4.402797     4
## 10         8 4.506849     5
## # ... with 11 more rows

Given the organized table above, we can see that there is no obvious trend in the scores based on education but there does seem to a steady increase in scores once years of education passes 10. The assocation between these two variables has to be further investigated through linear regression. The summarized data as of right now gives a general insight as to whether or not there is any form of correlation between the two.

sum_sex = vocab %>% group_by(sex) %>%
  summarize(mean=mean(vocabulary),md=median(vocabulary)) 
sum_sex
## # A tibble: 2 x 3
##      sex     mean    md
##    <chr>    <dbl> <dbl>
## 1 Female 6.032732     6
## 2   Male 5.947888     6

The initial point that stands out immediately from this summary is that the medians for both sex are identical. However let us compare the means between the two; it is clear that there is a difference of 0.084844. We must run further tests to conclude if this slight difference in means is significant in our analysis.

sum_yr = vocab %>% group_by(year) %>%
  summarize(mean=mean(vocabulary),md=median(vocabulary)) %>% arrange(mean)
sum_yr
## # A tibble: 16 x 3
##     year     mean    md
##    <int>    <dbl> <dbl>
##  1  1987 5.694461     6
##  2  1982 5.741149     6
##  3  1988 5.766304     6
##  4  1989 5.940083     6
##  5  1978 5.964960     6
##  6  1984 5.994294     6
##  7  2000 6.011442     6
##  8  1974 6.024205     6
##  9  1993 6.033564     6
## 10  1996 6.039657     6
## 11  1976 6.044630     6
## 12  1991 6.090531     6
## 13  1998 6.131437     6
## 14  1990 6.138498     6
## 15  1994 6.167391     6
## 16  2004 6.210709     6

Based on the summary alone we see no clear trend in the relation between vocabulary scores and the year the participants were survyed. However it is imperative that we further investigate in order to conclude whether there exists any correlation between the two.

  1. Plot out the graphs based on the means for the group education and for the group year
ggplot(sum_edu, aes(x=education, y=mean))+geom_point() 

Looking at the graph alone, it is clear that there is an upward linear between education and the mean vocabulary score. This gives us a general idea that individuals with more years of education scored on average higher on vocabulary test. Again, this is all for general overivew purposes that will serve as guidelines for what to look for when conducting other tests.

ggplot(sum_yr, aes(x=year, y=mean))+geom_point()

Notice when we plot out the average scores based on each year we see no clear trend. The only fact that can be drawn from this is that aside from the year 2000, scores did tend to be higher on average from 1990 to 2004 compared to the earlier years.

Note that the only two graphs we need to run are for education and years; it is impractical to graph a plot for gender means because we only have two values to work with. Not much information can be drawn by that alone.

The two graphs we plotted were not used to draw any form of statisical conclusion. Rather we used both the summaries and graphed version of the summaries as a means of getting a generalized overview of what we can look for.

  1. Plot out the graphs based on all the data from each variable (Note: R fucntion jitter is used to produce noise in our graphs in order to properly analyze it by eliminating plot overlap)
ggplot(vocab, aes(x=education, y=vocabulary))+geom_jitter()

With this graph, we observe a rather upward linear trend past the year 10 for education. This implies that for individuals with 10 years of education or above, there is an observed increase in vocabulary score as education increases.

ggplot(vocab,aes(x=year,y=vocabulary))+geom_jitter() 

Like the graph above, the jitter function has added quite a bit of noise to our graph. Let us take a look at the density of these points to draw some conclusions from this graph. Notice that there is a lot of density in perfect score (10 out of 10 on vocabulary test) in the years before 1980. In contrast, there is a lower density in perfect scores around 1990s. These differences in perfect scores could be causing the mean scores per year to be distorted. We established in our summarized table for this variable that median score is the same throughout the years and we see that in the graph as well since the density of the points around 6 is very high throughout the years.

ggplot(vocab,aes(x=sex,y=vocabulary))+geom_boxplot()+geom_jitter()

Let us start off by looking at the density for the female boxplot. We see that the highest density of plots revolves around the middle or in other words, the median. We do however notice some density in the lower tail of the boxplot and high density in the upper tail. This wil distort the mean and bring it down lower than what it should be.

Let us take a look at the density for the male boxplot. We can make a similar observation and say that the highest density revolves around the median. We also note that there is some density around the lower tail nd high density around the upper tail

First Stage Analysis 1. Run a multiple regression using all three variables

vocab.full=lm(vocabulary~year+sex+education,data=vocab)
summary(vocab.full)
## 
## Call:
## lm(formula = vocabulary ~ year + sex + education, data = vocab)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.362 -1.144  0.073  1.256  8.848 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 35.944550   2.963670  12.128   <2e-16 ***
## year        -0.017375   0.001495 -11.622   <2e-16 ***
## sexMale     -0.215443   0.025672  -8.392   <2e-16 ***
## education    0.367457   0.004261  86.232   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.867 on 21634 degrees of freedom
## Multiple R-squared:  0.2568, Adjusted R-squared:  0.2567 
## F-statistic:  2491 on 3 and 21634 DF,  p-value: < 2.2e-16

We notice a few things after running a multiple regression on the data as a whole. Firstly, our regression returns an R-squared value of 26%, which is fairly low. However our p-value for the test is extremely low. This suggests that there is an association between the factors year, sex and education but they are not neccesarily good predictors of vocabulary scores.

  1. Run a linear regression using only the variable: education
vocab.edu=lm(vocabulary~education,data=vocab)
summary(vocab.edu)
## 
## Call:
## lm(formula = vocabulary ~ education, data = vocab)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.560 -1.137  0.134  1.287  8.558 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.442395   0.055144   26.16   <2e-16 ***
## education   0.355900   0.004193   84.88   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.875 on 21636 degrees of freedom
## Multiple R-squared:  0.2498, Adjusted R-squared:  0.2498 
## F-statistic:  7205 on 1 and 21636 DF,  p-value: < 2.2e-16
ggplot(vocab.edu,aes(x=.fitted,y=.resid))+geom_jitter()

Our regression results show that there is a low R-Squared value of only 25%. Once again this proves that education may not be a good predictor of score. There is association between the education and score since our test returned a very low p-value meaning it is statisically significant afterall.

This lines up with our graph for the summarized means for education. We noted that there was an upward trend in scores as education increased. This regression partially proves that the assocation between the two does exist.

When checking to see how trustworthy our linear regression is, we must check if the residual plots follow a random plot distribution. In this case however, we notice a clear downward trend yet we can still go forward with this model. Why? This is because we are working with a discrete set of scores that goes from only 0 to 10. Therefore, we are not going to have any values below zero (in the negatives) and we certainly not going to have any values above 10. Therefore, despite non random residual plot distribution, it is safe to move forward with this model.

  1. Run a linear regression using only the variable: year
vocab.yr=lm(vocabulary~year,data=vocab)
  summary(vocab.yr) 
## 
## Call:
## lm(formula = vocabulary ~ year, data = vocab)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.1116 -1.0659  0.0027  1.1132  4.1170 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -9.156212   3.382082  -2.707  0.00679 ** 
## year         0.007619   0.001701   4.480 7.49e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.164 on 21636 degrees of freedom
## Multiple R-squared:  0.0009269,  Adjusted R-squared:  0.0008807 
## F-statistic: 20.07 on 1 and 21636 DF,  p-value: 7.494e-06
ggplot(vocab.yr, aes(x=.fitted, y=.resid))+geom_jitter()

Again we are faced with a low R-Squared value of 0.09269% along with a low p-value. Same as before, we draw the conclusion that the year in which the indivdual was survyed is not a good predictor of scores but it did in some way affect it (some form of association involved).

  1. Run a linear regression using the variable: sex
vocab.sex=lm(vocabulary~sex,data=vocab)
  summary(vocab.sex)
## 
## Call:
## lm(formula = vocabulary ~ sex, data = vocab)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.0327 -1.0327 -0.0327  1.0521  4.0521 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.03273    0.01951 309.208  < 2e-16 ***
## sexMale     -0.08484    0.02972  -2.855  0.00431 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.165 on 21636 degrees of freedom
## Multiple R-squared:  0.0003766,  Adjusted R-squared:  0.0003304 
## F-statistic: 8.151 on 1 and 21636 DF,  p-value: 0.004308
ggplot(vocab.sex, aes(x=.fitted, y=.resid))+geom_point()

Once again, we have a low R-Squared value paired with a low p-value, meaning there is association between the two variables.

Since our regression is telling us that there is some form of association between vocabulary scores and each individual factor, we must further investigate to get a better understanding of the affects.

Second Stage Analysis

  1. Confirm that the variables we are working with are normal (symmetric)
boxplot(vocab$vocabulary) #symmetric

boxplot(vocab$year) #symmetric

boxplot(vocab$education) #skewed

After carrying out the boxplots of all three variables, it is clear to see that both variables, vocabulary and year are symmetric, suggesting that both are of normal distribution. Unlike the rest, education seems to be right skewed. Normally we would consider logging the education variable in attempts of normalizing the data but in this case, since we have a discrete set of vocabulary scores, 0 to 10, we will not be faced with outliers. Therefore, the skewed boxplot for education may still be used without alteration.

  1. Further investigate the affect of sex and education by running a multiple regression
vocab.sexedu=lm(vocabulary~sex+education,data=vocab)
  summary(vocab.sexedu)
## 
## Call:
## lm(formula = vocabulary ~ sex + education, data = vocab)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.455 -1.160  0.124  1.198  8.703 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.508057   0.055641  27.103  < 2e-16 ***
## sexMale     -0.210690   0.025748  -8.183 2.93e-16 ***
## education    0.357865   0.004193  85.338  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.873 on 21635 degrees of freedom
## Multiple R-squared:  0.2521, Adjusted R-squared:  0.2521 
## F-statistic:  3647 on 2 and 21635 DF,  p-value: < 2.2e-16

From conducting the multiple regression, we see that the R-squared value is roughly 25%. This means that both variables, sex and education, explain about a quarter of the variations in the dependent variable, vocabulary scores. We also notice that the p-value is very low, suggesting that the model is statistically significant. We notice a decrease in P-value from th linear regression on sex to this multiple regression on sex & education. This suggests that education has a higher significance when associated with gender of students.

Our next step is to test for correlation between sex and education to check if our data is distorted based on the fact that a certain gender has more education thus leading to a higher score.

  1. Investigate for any existing correlation between the variables: sex and education
gender=as.numeric(factor(vocab$sex))

edu=vocab$education

cor(gender,edu)
## [1] 0.05727286

We find here that there does exist a strong correlation between the variables sex and education, which can possibly help us come to a conclusion to confirm that these two variables may have an affect on vocabulary scores. After calculating the correlation coefficient we recieve a positive value of about 0.06; this tells us that on average, male students have higher education.

  1. Test for normality of variable sex before running ANOVA
ggplot(vocab, aes(sample=vocabulary))+stat_qq()+facet_wrap(~sex) 

Before running the ANOVA test, it is imperative that we check for normality within our data. For this reason we ran a qq plot of our dependent variable, vocabulary, depending on the gender of students. From the plot we can clearly see that there are no existing outliers; however, there are a few short tails. In this case, we are safe to run the ANOVA test.

  1. Run ANOVA
vocab.aov=aov(vocabulary~sex,data=vocab)
  summary(vocab.aov)
##                Df Sum Sq Mean Sq F value  Pr(>F)   
## sex             1     38   38.20   8.151 0.00431 **
## Residuals   21636 101398    4.69                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Our ANOVA results returns a very small p value. This would suggest that there is a statistically significant difference in mean vocabulary scores for each gender. Nonetheless, we can further inspect this by running a two sample t-test.

  1. Run a two-sample t-test to compare with the ANOVA results ****
t.test(vocabulary~sex,data=vocab)
## 
##  Welch Two Sample t-test
## 
## data:  vocabulary by sex
## t = 2.846, df = 19835, p-value = 0.004432
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.02640996 0.14327937
## sample estimates:
## mean in group Female   mean in group Male 
##             6.032732             5.947888

After conducting the t-test, we are once again presented with a low p-value of 0.0044. This reassures our results from the previous ANOVA test, stating the statistically significant difference in mean. However, we also notice that the 95% confidence interval is roughly between 0.03 and 0.14 which is a rather small range in the 10 point score scale used for the vocabulary test. With this information alone, you could dismiss the small p-value and conclude that the difference in mean is rather insignificant. However, from our previous correlation test between sex and score, we concluded that men have higher education. This piece of information now gives us a different prespective when approaching our results. Since we already concluded that education positively affects vocabulary score, males should theoretically have a higher average vocabulary score. However our test shows that females have a higher score despite having a lower average education level. From this we can draw the conclusion that the difference in mean score is in fact significant.

Conclusion

After conducting numerous tests and plotting several graphs, we came to learn that all of our hypothesis proved to be true. Both education and gender had an affect on vocabulary scores, though education seemed to have a more significant impact. There was a very insignificant affect on vocabulary scores over the span of 30 years, not many trends were noticeably present.

We started off expecting to see vocabulary scores be dependent on the number of years students chose to invest in education. Our hypothesis was first reassured when we saw a clear upward trend in the mean vocabulary scores as the number of years invested in education increased. With further investigations, we found within our linear regression on education, that education may not be a significant predictor of how well a student does on the test. However, education does have an association with vocabulary scores due to the low P-value we observed. By looking at the coefficient, we are also able to see that vocabulary has a positive dependency on education. Logically speaking, this makes perfect sense as with more education and practice, students are bound to make progress on academic evaluations.

Unlike education, the year the test was taken did not have any significant impact on vocabulary scores. After running a regression on year and scores, we found that similar to education, vocabulary score cannot be predicted well by the year alone. The regression did reveal some sort of association between the two. When analyzing the extent of this association, we noticed no noticeable trends within the means of scores over the years scatterplot or the full vocabulary scores over the years jitter plot. When we look at the coefficient within the regression we can see that it is much lower than the education variable, meaning that the test scores were not notably impacted as the years progressed. This outcome was closely related to the article Barshay (2013) wrote, which stated that the grades of students made marginal progress within the last 40 years.

When analyzing the impact of gender, we discovered that females scored higher on the vocabulary tests than males, confirming our hypothesis. The first indication that lead to this discovery was the jitter plot for the variables sex and vocabulary scores. Though the results of both the female and male categories looked similar, it was noticeable that the results of females were more dense near the higher levels of scores, compared to males. After running a linear regression, we were faced with similar results as the previous variables, stating that sex is not necessarily a great predictor of vocabulary but scores are still dependent on it to a certain extent. When we took the coefficient under consideration, the negative value for the variable sexMale proved that vocabulary scores have a somewhat negative relation to male students, meaning females affect the scores positively. Our ANOVA table communicated that the difference in mean scores between the two gender was statistically significant and we solidified this claim by looking at the correlation between education and gender. Our findings showed that men had more years of education on average compared to females, however females still ended up with the higher average scores.

Discussion

The results attained from this study proved all of our hypotheses to be true. We concluded that vocabulary test scores were more dependent on the length of education and the gender of the students. We can use this information to possibly improve the educational system and encourage other enhancements in the educational and career fields. Since we know that girls naturally perform better in vocabulary, even with lower educational backgrounds, it can help to discover new job opportunities for females, that would complement their strengths in vocabulary. Similarly, we can work towards advancements in teaching methods to ensure that we see improvements in the grades of students, as our study revealed the minimal affect the progression of years had on test scores.

The results attained from our investigations did support some of the previous studies discussed earlier in literature. Our findings of girls scoring higher than boys on the vocabulary tests, add to the articles we found earlier. According to Turcotte (2011), women show higher levels of academic understanding from as early as elementary. The reason girls often do better in school is because of certain characteristics that are more common in girls than boys. According to the journal written by X. fang Xia (2013), it is known that girls are naturally more senitive to grammer and the concept of language in general. This would explain as to why the female students in our data set, also did better than the male students on their vocabulary tests. On another note, we see that there have been very little improvements of school grades, over the last 30 years which corresponds to the study done by Barshay (2013).

Although the results of our study have shown that female students with higher levels of education tend to do better in vocabulary tests, it is difficult to generalize this theory as it has limitations. In our data set we only worked with three independent variables, which we used to test how it affects scores independently. However, in the real world, there are many other factors that may also play a large role on the test scores. From the findings of the X. Xia (2012), we see that the academic grades of students can vary depending on numerous unique factors. Some of these factors can be whether English was the first language of the students, their socioeconomic status, the ethnicity of students, educational backgrounds of their families, the involvement of extracurricular activities, and so forth. The main concern is that the vocabulary test scores are not exclusively dependent on only educational background and the gender of students, and for this reason further research in this topic is definitely necessary.

References

Barshay, Jill. 2013. “High School Test Scores Haven’t Improved for 40 Years; Top Students Stagnating.” Theory and Practice in Language Studies. Academy Publisher.

Turcotte, Martin. 2011. “Women and Education.” Statistics Canada. Minister of Industry, 89–103.

Xia, Xiu fang. 2013. “Gender Differences in Using Language.” Theory and Practice in Language Studies 3 (8). Academy Publisher: 1487–9. doi:10.4304/tpls.3.8.1485-1489.

Xia, Xiufang. 2012. “Summary of Major Findings.” Theory and Practice in Language Studies. Academy Publisher. doi:10.4304/tpls.3.8.1485-1489.