Introduction

Essex County, New Jersey is the home of Montclair State University and currently over 800,000 people from a wide variety of diverse backgrounds. This project takes a look at languages spoken at home in Essex County households, using data from the 2015 American Community survey. By investigating where ESL speakers are most in need, local libraries and schools can better direct their efforts to provide learning resources to these communities, specialized for the students’ native languages.

Languages Spoken at Home

Identifying ESL Need via Census Data

Certain municipalities may benefit from increased availability of English as a Second language educational resources and at local schools and libraries. To identify which areas may be in need, we will look for places where the language spoken at home is not English and the residents were rated Speak English less than “very well” on the American Community Survey. By investigating the primary language spoken at home, we can better provide ESL materials in the native languages of the learners, improving accessibility. This critical information should be provided to the municipality’s libraries and schools so educators can act on community needs.

There may be an area in need between Livingston and South Orange. However, this may be due to the location of the South Mountain Reservation park skewing population data. Information about how this tract/PUMA was determined could not be ascertained, but it seems to be a red herring. An examination of this county on a by-language basis follows.

Population by Language Spoken at Home by English Ability

Italian

Spanish

French

Portuguese

Chinese

Vietnamese

Arabic

Referring to Google Maps for approximate municipality names

We can now observe specific areas correspond to the hotspot locations previously highlighted in the overall “limited English” Essex county map shown at the start. The following regions could benefit from increased ESL resources for…

Questions and Concerns About the Census Data

How can we be sure these populations actually “speak English less than very well”? The criteria was arbitrarily assigned by the census taker, so it may be subject to bias. For example, were the speakers rated poorly just for having a non-standard accent, even if they could be easily understood? This subjective judgement should be taken with a grain of salt.

Next Steps

Survey data can be used to effectively ascertain locales in need of English as a Second Language educational resources. How can we customize resources to best assist ESL learners based on their personal native languages? To gain insight into this, we will analyze a learner corpus.

Investigating a Learner Corpus

The University of Pittsburgh English Language Corpus (PELIC) is a dataset comprised of essays written by non-native English learners. It was collected in fall, spring, and summer semesters between the years of 2006 through 2012. The dataset includes useful information such as the learner’s native language (L1) and English proficiency according to course level.

Development of ESL Writing Ability

We can observe the improvement of students’ writing ability as represented by the mean text length of all student essays and written responses increasing as course difficulty increases. (Texts with length less than 10 tokens have been excluded, as the data contained multiple choice and fill-in-the-blank responses among the essay data.) The course levels are as follows:

  • 2 Pre-Intermediate
  • 3 Intermediate
  • 4 Upper-Intermediate
  • 5 Advanced

The visualizations below are organized according to the students’ native language. Students in the Upper-Intermediate and Advanced courses cultivated a definite increase in the length of text they were able to produce. We can see pursuing high-level coursework has a dramatic effect on the students’ writing proficiency.

Improvements You Can See!

Italian

Spanish

French

Portuguese

Chinese

Vietnamese

Arabic

What Did Students Write About?

Now we will use wordclouds to express term frequency in the students’ writing. By examining production tasks, we can attempt to identify areas where learners may need specific support based on their L1s.

Italian

Spanish

French

Portuguese

Chinese

Vietnamese

Arabic

Further Research Directions

Originally I intended to make Part of Speech clouds, to investigate whether certain POS were under or overused in learner production, to gain some insight towards providing personalized ESL educational materials according to the students’ L1. This was inspired by a cursory overview of the texts themselves, where I observed some students underutilizing determiners more often when their native language was Chinese vs Spanish. I wanted to see if this would show up in a frequency analysis of the POS. However, tagging seems to be a compute-intensive task in the R language. I attempted to run a POS annotation package, but the process crashed even on hefty lab computers. There is a token_lemma_POS column in the dataset, but I was unable to extract the POS from here either. See code comments for more info. I would like to have examined this further. Instead, we saw some traditional word clouds.

Conclusion

Local census data can tell us a lot about where ESL learners may be in need. By analyzing county statistics on language spoken at home, were able to identify locales that may benefit from increased ESL resources in schools and libraries. Learner corpora such as PELIC can give insight to what these resources may look like based, personalized on an individual basis for the students’ native language.



References

Data provided by:

Packages:

# easily find package citations using the citation() function!
#citation("tidyverse")