I. Summary

Three corpora: tweet (160 Mb), blog (201 Mb), and news (197 Mb) were explored in preparation for creating a text prediction model. The corpora were explored in three stages: prior to cleaning (full corpus and 10% sub-sample), after reshaping lines to sentences (10% sub-sample), and after cleaning (10% sub-sample). The 10% sub-samples were representative of the corpora, as suggested by general properties and density plots. Parts of speech analysis classified a large proportion of the corpora as nouns and verbs. A function was made to clean the corpora and create (1-4)-grams. Analysis of these (1-4)-grams demonstrated an increased uniqueness with length and the importance of closed-case words in creating longer phrases. From parts of speech and n-gram analysis, some word types (e.g. nouns, verbs), may be more difficult to predict because of their large variety. Using longer n-grams may not necessarily be an efficient approach. These are more likely to contain rare words such as proper nouns with very specific relevance to time and place.

II. Background, Preparation, and Exploration

A. Background

The overall objective is to use large language models to create an Shiny app that predicts text. This interim report provides an overview of the data that will be used to create the models.

There are three corpora provided through the course as text files: tweet (160 Mb), blog (201 Mb), and news (197 Mb). These were collected in 2014. They consist of several lines containing letters, numbers, and symbols. The three sets represent three different writing styles and vocabulary. A brief exploration of the full corpora was conducted, followed by more detailed analysis of random subsamples representing 10% of the lines from each corpus.

B. Process summary

Figure 1. Diagram summarising the process and figures in this report (.png created in Google slides) .
Figure 1. Diagram summarising the process and figures in this report (.png created in Google slides) .

C. Reading files

The tweet and blog files were read using the readLines function, with settings to avoid prematurely terminating lines containing nulls. An issue with the news file was premature termination at line 77259. The solution was to use fpeek to read the file in two parts, flanking the problematic line. This has the potential consequence of premature termination of some lines that contain characters interpreted as null.

D. General properties

1. Full corpus and 10% sub-sample counts

Figure 2a shows total counts for each corpus. The corpora contained ~899K - 2.36M lines, and ~162M - 206M characters. Although tweet had the highest total lines, it had the least total characters, indicating shorter lines compared to blog and news. Figure 2 b - d shows density plots with table insets. Columns labelled (nr > 1.5*IQR) is the number of lines considered outliers defined as greater than:
third quartile + 1.5 x (third quartile - first quartile)
Using this formula, it was determined that blog and news contained 24K - >35K lines that were unusually long compared to their respective character and word per line distributions.

The 10% sub-sample closely resembled the full corpora in terms of density distribution, minimum, median, and mean character and word per line. The sub-samples contained about 10% of the lines considered as outliers. The 10% sub-sample was therefore used for more detailed exploration.

Figure 2. Summary of total counts (a); and density plots of characters per line, with per line character (char) and word (word) summary statistics (b - d). Data for the full corpora, and 10% subsamples (suffix 10 or 10% sub) are shown. Plots for blog and news (c and d) were truncated at 4000 characters for clarity. Columns nr> 1.5*IQR are the number of lines that would be considered outliers.
Figure 2. Summary of total counts (a); and density plots of characters per line, with per line character (char) and word (word) summary statistics (b - d). Data for the full corpora, and 10% subsamples (suffix 10 or 10% sub) are shown. Plots for blog and news (c and d) were truncated at 4000 characters for clarity. Columns nr> 1.5*IQR are the number of lines that would be considered outliers.

2. Parts of speech (10% subsample)

Parts of speech (pos) tagging (e.g. noun, verb) provides additional information on corpora structure (Figure 3). Tags were generated using spacyr, after reshaping lines to sentences using quanteda.

The tags include traditional English grammar pos (further classed as open or closed). Tokens (e.g. words, symbols, punctuation) that cannot be included in those groups are included in classes under ‘other’. Open and closed-class words differ in tha those in open class quickly evolve, where new words are constantly added; whereas closed pos stay constant over long timeframes (https://www.ucl.ac.uk/internet-grammar/wordclas/open.htm).

The corpora contained a greater proportion of open than closed-class words (Figure 3a, 45 - 58% vs. 38 - 43%). Nouns and verbs were highly represented in the corpora (Figure 3a), and were also highly varied (Figure 3b). In contrast, the closed article and determiner classes had the lowest variety.

Close inspection of the tags revealed imperfect pos tagging mainly due to unconventional use or integration of punctuation, symbols, removal of spaces (e.g. hashtags and webpages). These symbols, punctuations, etc. will be removed in the next steps to create (1-4)-grams.

a. Parts of speech composition.
b. Within corpora token variety for each part of speech.
Figure 3. Parts of speech treemap showing contributions (a), and relative token variety in parts of speech (b) for 10% subsamples after reshaping to sentences, and before cleaning or filtering. For (b): total tokens/unique tokens were computed for each pos in each corpus. Higher number indicates less token variety for that pos within a corpus.

Abbreviations (https://universaldependencies.org/u/pos/): adj - adjective, adp - adposition, adv - adverb, aux - auxiliary, cconj - coordinating conjunction, det - determiner, intj - interjection, noun - noun, num - numeral, part - particle, pron - pronoun, propn - proper noun, punct - punctuation, sconj - subordinating conjunction, sym - symbol, verb - verb, x - other.

3. N-grams (10% sub-sample)

A function was created to clean the corpora reshaped to sentences and create dataframes of (1-4)-grams, following the examples of Hansen (2017), Schweinberger (2024), and Van Otten (2023), with modifications. Cleaning was done at the line then the word level to remove hashtags, web and e-mail addresses, symbols, equations, non-standard English characters, digits, and swear words. It retained hyphenated words (hyphens replaced by spaces when there are 3 or more hyphens), and periods in a selected list of abbreviations. The words were filtered based on a maximum character length. Then words for creating n-grams were again filtered based on a minimum occurrence. A second set was created by filtering out the most common words.

i. (1-4)-grams frequency

i_1. Original

Word frequency summaries followed the examples of Schweinberger (2024) and Hansen (2017). From Figure 4, the most frequent (1-3)-grams were found in all three corpora. The contribution of the top 40 to the total, and the number of shared n-grams declined as n increased. The most frequent 1-grams -“the”, “to”, “i”, “a”, “you” - are determiners (Figure 2). These were also components in 35 - 40 of the top 40 (2,4)-grams.

Figure 4: The 40 most frequent (1-3)-grams. Colours indicate similarity across the three corpora where Legend: 1 = unique to the corpus, 3 = shared across all 3, bn = in blog and news, and bt = in blog and tweet. The “Top 40” is the % contribution of the top 40 n-grams to the total for that n-gram. The “All Share” is the percentage of n-grams also found in the other two corpora.
Figure 4: The 40 most frequent (1-3)-grams. Colours indicate similarity across the three corpora where Legend: 1 = unique to the corpus, 3 = shared across all 3, bn = in blog and news, and bt = in blog and tweet. The “Top 40” is the % contribution of the top 40 n-grams to the total for that n-gram. The “All Share” is the percentage of n-grams also found in the other two corpora.
i_2. After removing list of words (-L)

The following high frequency words (Figure 4) were removed from the corpora to understand their effects on (1-4)-grams.

L1 <- c("to", "the", "a", "an")
L2 <- c("i", "you", "u", "he", "she", "it", "they")
L3 <- c("from", "and", "for", "in", "is", "of", "on", "that", "was", "with")
L4 <- c("he's", "she's", "it's")
L <- c(L1, L2, L3, L4)

Figure 5 was similar to Figure 4, in that a majority of the most frequent (1-3)-grams were found in all three corpora. As expected, removing L decreased in the contribution of the Top 40s to the total (1-3)-grams, and the number of n-grams common to all three corpora declined. This decline was greater in larger n-grams: 2.6 - 2.8 times lower for 3-grams. There were also more instances where n-gram is not in all 3 corpora after removing L.

Figure 5: The 40 most frequent (1-3)-grams after removing 24 frequent words (-L). Colours indicate similarity across the three corpora where Legend: 1 = unique to the corpus, 3 = shared across all 3, bn = in blog and news, and bt = in blog and tweet. The “Top 40” is the % contribution of the top 40 n-grams to the total for that n-gram. The “All Share” is the percentage of n-grams also found in the other two corpora.
Figure 5: The 40 most frequent (1-3)-grams after removing 24 frequent words (-L). Colours indicate similarity across the three corpora where Legend: 1 = unique to the corpus, 3 = shared across all 3, bn = in blog and news, and bt = in blog and tweet. The “Top 40” is the % contribution of the top 40 n-grams to the total for that n-gram. The “All Share” is the percentage of n-grams also found in the other two corpora.
i_3. 4-grams

As with (1-3)-grams (Figures 4 and 5), a number of the most frequent 4-grams were found in all three corpora (Figure 6). In the set with filtered words (-L), the overlap in 4-grams has noticeably declined compared to (1-3)-grams. Comparing before and after removing L, the difference in number of n-grams common to all three corpora has further declined in 4-grams, by 5-6 times (Figure 6 set 1 vs. 2).

Longer n-grams can be more -time and place- specific. This seems to be the case in the (-L) column, which included “happy cinco de mayo”, “senate president stephen sweeney”, and “case western reserve university”. News -L (Figure 6 c.2) also had more proper nouns than the other sets. These properties may influence prediction efficiency since proper nouns tend to have very high variability (Figure 2). These observations might indicate diminishing returns when using long n-grams for general text prediction models.

Figure 6: The 40 most frequent 4-grams before and after (-L) removing 24 frequent words. Colours indicate similarity across the three corpora where Legend: 1 = unique to the corpus, 3 = shared across all 3, bn = in blog and news, bt = in blog and tweet, nt = in news and tweet. The “Top 40” is the % contribution of the top 40 n-grams to the total for that n-gram. The “All Share” is the percentage of n-grams also found in the other two corpora.
Figure 6: The 40 most frequent 4-grams before and after (-L) removing 24 frequent words. Colours indicate similarity across the three corpora where Legend: 1 = unique to the corpus, 3 = shared across all 3, bn = in blog and news, bt = in blog and tweet, nt = in news and tweet. The “Top 40” is the % contribution of the top 40 n-grams to the total for that n-gram. The “All Share” is the percentage of n-grams also found in the other two corpora.

ii. Coverage

In Figure 7, a steep slope and higher y-axis plateau location suggests that a few unique n-grams are responsible for a large proportion of corpus n-grams. As n increases from 1 to 4, the counts of frequently used n-grams in the corpora declines, signalling an increase in the proportion of rare word combinations. This is also demonstrated by the increased contribution of n-grams that occurred once (Freq1_pct), which increases from 1.32 - 1.62% (of words that only occur once) for 1-grams to 82.59 - 87.08% (of phrases that only occur once) for 4-grams. There is a further decline in repeated n-grams when a list of closed-class words were excluded from the corpora (-L, column 2). As well, the number of unique 4-grams was lower than the number of 3-grams after removing L. In comparison, the number of unique 4-grams was greater than the number of 3-grams prior to L removal, consistent with the importance of the L words in sentence construction.

Figure 7. Frequency of unique (1-4)-grams relative to total unique n-grams before and after (-L) removing a list of selected words. Horizontal lines indicate 50% and 95% total frequency. Table columns include: Freq1_pct = sum of % frequency of n-grams that only occur once; and Total_unique = total number of unique n-grams. To produce the plots, a random sample of 199 was taken from the top 3% most frequently repeated n-grams, and 599 samples from the remaining 97%, together with the most frequent, and the last of the least frequent n-grams.
Figure 7. Frequency of unique (1-4)-grams relative to total unique n-grams before and after (-L) removing a list of selected words. Horizontal lines indicate 50% and 95% total frequency. Table columns include: Freq1_pct = sum of % frequency of n-grams that only occur once; and Total_unique = total number of unique n-grams. To produce the plots, a random sample of 199 was taken from the top 3% most frequently repeated n-grams, and 599 samples from the remaining 97%, together with the most frequent, and the last of the least frequent n-grams.

E. Next steps, creating model, Shiny app

The next steps would be to i) create models for predicting text and ii) create a Shiny app toy for text prediction. Models will be created by taking into account word combinations with higher probabilities based on frequencies of n-grams. For example, the nth word would be predicted based in its observed likelihood of occurring, given the n-1 and n-2 words. The accuracy of the models could be verified using unused portions of the corpora. The Shiny app will consist of a field where users can enter a string of words, limited to a maximum length (likely <30, based on Figure 1). Predicted text will then be provided based on user-entered words, and the chosen prediction model.

Exploratory analysis indicated potential difficulties in accurately and efficiently predicting highly varied, but essential open-class words. Of particular concern is the increase in n-grams with frequencies = 1, and inclusion of proper nouns, as n-gram length increases. As well, the high instances of words, such as “a”, “an”, and “the”, might influence the inclusion of less frequent but meaningful words in predictions.

F. References

III. Appendix

  • Only sample and partial codes are shown for length.

A. System and packages

1. System

R 4.3.3; RStudio build 402; Windows 10; 64 bit, 1.1 GHz Intel Celeron; 4.00 GB RAM; 60 GB HD; 30 GB external drive.

2. Packages

library(dplyr)
library(fpeek)
library(ggpmisc)
library(ggplot2)
library(ggpubr)
library(plotly)
library(quanteda)
library(reticulate)
library(spacyr)
library(stringr)

B. Full corpora

1. Read files

i. File location connections

enTwit <- "Coursera-Swiftkey/final/en_US/en_US.twitter.txt"

ii. Using readLines

con <- file(enTwit, "r")
twall <- readLines(con, n = -1, skipNul = TRUE)
close(con)

iii. Using fpeek

  • To count lines in document prior to opening file.
  • Necessary for news because of termination at line 77259.
#number of lines
peek_count_lines(enNews)

#77258 before problematic line
Nws_h2 <- peek_head(enNews, 77259, intern = TRUE)
Nws_t <- peek_tail(enNews, 932983, intern = TRUE)

2. Basic corpora properties.

i. Per line word and character counts

#words per line
twall3$word <- str_count(twall3$text, "\\w+")
#characters per line
twall3$lng <- nchar(twall3$text)

C. 10% Subsample

1. Reshape lines to sentences

  • tweet sentences = 375896
  • blog sentences = 235887
  • news sentences = 199129
tw10s <- corpus_reshape(tw10c, to = "sentences")

i. Parse sentences using spacyr, tag with pos

i_1. Parse corpora
##tweets
tw10p <- spacy_parse(tw10s, nounphrase = FALSE, lemma = FALSE, entity = FALSE)
i_2. Token variety per part of speech
v1 <- tw10p[tw10p$pos == "det", ]
v2 <- length(unique(tolower(v1$token))) 
v3 <- nrow(v1) / v2

ii. Clean and tokenize sentences using function

i_1. Clean sentences and create (1-4)-grams
i_1a. Create function
#part of step 1 to clean corpus 
tw10s2 <- a %>% 
  #change all to lower-case
  tolower() %>% 
  #replace all non "letter space ' - ." combinations (0 or more times) with blank
  str_replace_all(., "[^[:alpha:][:space:]['][-][.]]*", "")

# Not shown: additional cleaning steps to remove swear phrases; separate individual words (tokenise); further removal of swear words, punctuation, and filtering.

#part of last steps to create 2-gram; pastet words separated by space 
c$bgram <- paste(c$word1, c$word2, sep = " ")
i_1b. Run function to create (1-4)-grams
# >19 characters are replaced with spaces, include in 2-4 grams if frequency of 1-gram > 9, exclude word "noexceptionsbananarama" (essentially, no words excluded)
L <- c("noexceptionsbananarama")
tw3 <- gr3(tw10s, 19, 10, L)

### with word exceptions
# >19 characters are replaced with spaces, include in 2-4 grams if frequency of 1-gram > 9, exclude words in list L
L1 <- c("to", "the", "a", "an")
L2 <- c("i", "you", "u", "he", "she", "it", "they")
L3 <- c("from", "and", "for", "in", "is", "of", "on", "that", "was", "with")
L4 <- c("he's", "she's", "it's")
L <- c(L1, L2, L3, L4)
tw4 <- gr3(tw10s, 19, 10, L)
i_1c. Additional analysis of (1-4)-grams
  • not shown