Introduction

This project examines lexical diversity in oral interviews with three groups of heritage Spanish speakers: highly proficient speakers, speakers with more limited fluency and speakers who rarely use Spanish.

Data preparation and exploration

Data Acquisition

These data were sent to me in an email by Maria Ciriza. I received Word files that I converted into text files, first excising elements that are obviously extraneous to the analysis to be performed (i.e. headings, numbers, etc.), then compressing them and placing them in a GitHub repository.

In addition to the data files, this script can be found in the same repository, in the interest of making sure that this analysis is fully reproducible.

To begin, download the text files from the repository and switch working directories to the folder where the files where downloaded. We initially print out the working directory used here and its contents to show what your working directory should look like.

# Set working directory and print out its contents
getwd()

## [1] "/Users/christopherstewart/Desktop/sp_heritage"

list.files()

## [1] "group1.txt"           "group2.txt"           "group3.txt"          
## [4] "Ngrams_tokenizer.R"   "RStudioScript.html"   "RStudioScript.Rmd"   
## [7] "sp_heritage_txts.zip"

Data cleaning

To begin with, we clean our corpus to prepare it for model building. This code chunk is for illustrative purposes only, so that you can see what the code looks like. Future code blocks will be hidden from view in the HTML document produced.

# Load required packages
suppressPackageStartupMessages(require("stringr"))

# load in data from 3 groups
group1_corp <- readLines("group1.txt")
group2_corp <- readLines("group2.txt")
group3_corp <- readLines("group3.txt")

# convert all text to lower case
group1_corp.1 <- tolower(group1_corp)
group2_corp.1 <- tolower(group2_corp)
group3_corp.1 <- tolower(group3_corp)

# replace punctuation with single space
group1_corp.2 <- str_replace_all(group1_corp.1, "[^[:alnum:]]", " ")
group2_corp.2 <- str_replace_all(group2_corp.1, "[^[:alnum:]]", " ")
group3_corp.2 <- str_replace_all(group3_corp.1, "[^[:alnum:]]", " ")

# eliminate digits, replace heading and trailing spaces left behind by corpus cleanup
group1_corp.3 <- str_replace_all(group1_corp.2, "[[:digit:]]+", " ")
group1_corp.c <- str_replace_all(group1_corp.3, "  ", replacement = " ")

group2_corp.3 <- str_replace_all(group2_corp.2, "[[:digit:]]+", " ")
group2_corp.c <- str_replace_all(group2_corp.3, "  ", replacement = " ")

group3_corp.3 <- str_replace_all(group3_corp.2, "[[:digit:]]+", " ")
group3_corp.c <- str_replace_all(group3_corp.3, "  ", replacement = " ")

# Clean up
rm(group1_corp); rm(group2_corp); rm(group3_corp)
rm(group1_corp.1); rm(group2_corp.1); rm(group3_corp.1)
rm(group1_corp.2); rm(group2_corp.2); rm(group3_corp.2)
rm(group1_corp.3); rm(group2_corp.3); rm(group3_corp.3)

Tokenization, n-gram constructions, frequency tables and n-gram probabilities

We now tokenize and produce 2-, 3- and 4-grams using Maciej Szymkiewicz’s efficient Ngrams_tokenizer function.

Next, we use R’s “data table” package to store and organize our n-grams.

We use these new data tables to now derive frequency counts for our n-grams.

Initial Exploratory Data Analysis

For this initial exploration, we will build barplots of 20 most frequent n-grams for the 3 groups.

We now build barplots of the 20 most frequent tokens across the 3 groups.

And barplots of the 20 most frequent bigrams across the 3 groups.

Lexical Diversity in Speakers of Spanish as a Heritage Language

Christopher Stewart

May 14, 2015