Data Science Specialization Capstone Milestone Report

Introduction

People are spending more and more time using their mobile devices for various functions, some of them are related to text typing. However, typing with mobile devices is not always easy. This is why tech companies, such as SwiftKey, create applications that help people type more easily. Providing further text suggestions to the application user is achieved with predictive text modeling. For example, if we have a sentence:

Mary had a little …

Google search text prediction function tells us the next most likely word is lamb. The purpose of this project is to create an application that would predict the following words in a similar way.

For this project, we will use three files containing texts from Twitter messages, blogs and internet news in US English, provided by SwiftKey and downloaded from Coursera. They will be processed and explored below.

Data Processing

Data processing workflow will look like this:

We will read all lines from every text file (Twitter, Blogs, News) with a sample of 5 rows for each
We will extract basic information about every file (numbers of lines, words and symbols; size in MB)
We will split each text file into training and testing part for further model development
- Training part will contain 80% of lines from each file; testing part will contain 20%
- Only training part will be used for Exploratory Data Analysis; the testing part will be used only for model checking and validation
We will create a text corpus of training part
- A text corpus is a large and structured set of texts
We will split the corpus into sentences and words, and clean it
- English stopwords (the most common words in the language) and profanities will be removed
For every corpus tokenized into words, we will create sets of $n$-grams (from 1 to 4) and create document-feature matrices (DFM) for them
An $n$-gram is a sequence of $n$ contiguous words in a text
A document-feature matrix (DFM) is a matrix describing frequency of features (terms) occurring in a corpus
For every DFM corresponding to a text source and $n$-gram length, we will count a frequency for each feature. These feature-frequency dataframes will serve as a basis for further modeling
For every text source and $n$-gram length, we will compute the following:
- A number of unique words/$n$-grams
- A sum of feature frequencies
- A percentage of features in a given feature-frequency dataframe that you need to cover 50% and 90% of instances
- A percentage of non-English language features

Firstly, we will load all three files to investigate their basic properties.

Here is an example of Twitter texts:

## [1] "Things going so well! We've written 2 songs, finished another, and arranged a fourth. So excited for all the progress!"         
## [2] "yea. Plus they run at Sports Academy in the mornings."                                                                          
## [3] "Desk put together, room all set up. Oh boy, oh boy"                                                         
## [4] "Save the Date: AdFed Miami \"Lemonade\" Documentary Screening with Erik Proulx. Nov 17th 630pm @ Books and Books (Coral Gables)"
## [5] "Way overdue for a #date! #patience is a virtue"

In a similar way, we can extract sample lines for blogs:

## [1] "Now, speaking in tongues may manifest not only in personal prayer but it may also be the media whereby the Lord wishes to express a word of prophecy to the community. In such cases it is necessary that there be someone in the community to interpret so that it can be of benefit to all. One is also free to ask the Holy Spirit to interpret their prayer language when praying in private."
## [2] "Summary of clinic: Four and a half hours. 15 minutes of face time. A surgery I don't want, but which should have happened already, no bracing for standing/walking, no good advice for an issue we have, no urology answers, more appointments."                                                                                                                                                  
## [3] "My cake schedule usually looks like this."                                                                                                                                                                                                                                                                                                                                                        
## [4] "Firpo's Balloon Cocktail (mildly adapted)"                                                                                                                                                                                                                                                                                                                                                        
## [5] "Step 6. Create different mindmaps and then translate to word document, pdfs, powerpoints, websites based on a particular purpose."

…and news:

## [1] "We took a look at Unit 7, a third-floor 1,780-square-foot two-bedroom plus study that features red oak flooring throughout, a custom kitchen with Bosch appliances, a living/dining space with a gas fireplace and a Carrara marble master-suite bathroom. The unit is on the market for $699,000."
## [2] "Florissant Police Chief William Karabas urged the council to publish a notice of intent to enter into a PLA. Karabas is chairman of the Emergency Communications Commission."                                                                                                                      
## [3] "The driving event features dressage and cones on Oct. 8 and the marathon Oct. 9. There is no admission or parking fee."                                                                                                                                                                            
## [4] "Bad: Robinson loves Barton's defense and Nelson's offense, but the performance of those third guards (0 for 7 in 34 minutes) raises the question of whether the Beavers have enough guard depth right now to pull off the strategy."                                                               
## [5] "Published: April 15, 2000"

Basic characteristics of these data sets - number of lines, words and symbols, and size in megabytes - are provided in the table below:

File	Lines	Words	Size_in_MB	Symbols
Twitter	2,360,148	30,373,543	318.99	162,096,031
Blogs	899,288	37,334,131	255.35	206,824,505
News	1,010,242	34,372,530	257.34	203,223,159

Exploratory Data Analysis

The selected training sample will be tokenized (i.e., split) into sentences and words. Also, English stopwords and profanities will be removed, to get rid of inappropriate words and functional words than contribute little to the model.

Apart from separate words, we will analyze $n$-grams - contiguous sequences of $n$ words from selected text subsamples. In this report, $n$ values will be 2, 3 and 4, which can be also called bigrams, trigrams and quadrigrams, respectively.

For the analysis, a document-feature matrix will be created from the selected corpus subsets. It describes the frequency of terms (features) in a collection of documents. For this matrix, we will report summary data and calculate sparsity, which is the proportion of cells with zero counts.

For each type of source, the following items will be reported:

Number of total tokens
Number of unique tokens
Lollipop plot for top 20 features
Distribution of features by frequency - for both total and unique tokens
Percentage of features that cover 50% and 90% of all instances
Percentage of non-English texts

This summary is given below. As we can see, the number of unique single words is much smaller than for $n$-grams with $n>1$. However, the number of total tokens decreases when $n$ increases. Less than 1% of single words are necessary to cover 50% of single word instances, and 3-7% are needed to cover 90% of them. 7-13% of bigrams are enough to cover 50% of bigram instances, and about 80% are necessary to cover 90% of them. In case of higher $n$ values, the percentage of features necessary to cover given thresholds is almost equal to those thresholds. Non-English texts make about 1% of the sample:

File name	ngram length	Unique tokens	Total tokens	50% threshold	90% threshold	Non-English percentage
twitter	1	334461	13383952	0.16	3.70	1.12
twitter	2	4800804	10459482	7.73	78.20	1.39
twitter	3	6979666	7921334	43.25	88.65	1.37
twitter	4	5674575	5885767	48.13	89.62	0.97
blogs	1	268116	15143342	0.36	5.53	1.15
blogs	2	6977089	13286612	12.81	80.95	1.43
blogs	3	10836618	11544175	46.73	89.34	1.22
blogs	4	9852073	9958155	49.46	89.89	0.83
news	1	225148	15651113	0.51	6.51	1.34
news	2	6929962	14074563	11.08	79.68	1.71
news	3	11170260	12551002	43.81	88.76	1.35
news	4	10689294	11115454	48.00	89.60	0.82

Top 20 words/$n$-grams for each text source and $n$ value are presented below. For all sources, the larger $n$, the lower feature counts. As we can see, the most frequent entries are related to famous people (“president barack obama”, “martin luther king jr”, “hunter matt hunter matt”), celebrations (“happy birthday”, “happy mother’s day”, “happy cinco de mayo”), food (“cake cake cake cake”, “g fat g saturated”). people in authority positions (“county prosecutor’s office”, “superior court judge”, “chief executive officer”), time (“last year”, “last week”, “two years ago”):

Frequencies of unique $n$-grams/words by text source and $n$ are given in lollipop plots below. As we can see, most distinct $n$-grams have a frequency of 1, and this proportion increases with $n$:

Total frequencies of $n$-grams/words by text source and $n$ are presented below. Similar to frequencies for unique $n$-grams, the proportion of $n$-grams with a frequency of 1 increases with $n$. However, for lower $n$ values, the percentage of total tokens for high frequency $n$-grams is much higher (compared to distinct counts above), as it might be expected:

For all data sources and $n$ values, document-frequency matrix sparsity exceeds 99.99%.

Modelling Plans

Since frequency of words and $n$-grams can be expressed in proportion, it is natural to assess probabilities of the following words given the beginning of the phrase as ratios between particular word counts and total counts for the beginning. However, we want to simplify our assessment using only one or several words instead of the whole sentence. This is why $n$-grams will be used in the word prediction model.

$n$-grams that were generated above will be split into $n-1$-grams containing with the first $n-1$ words and the final word. Our task is to predict the following word for a given phrase, so $n-1$ last words of the phrase will be proceeded as an input. For this $n-1$-gram, we will find out words that are the most frequent in respective $n$-grams.

The model given above can be used when words are predicted with 2-gram models (the last given and predicted word):

\[ P(X_{n}=x \mid X_{n-1}=x_{n-1}, X_{n-2}=x_{n-2}, \ldots , X_1=x_1) = P(X_{n}=x \mid X_{n-1}=x_{n-1}) \]

For higher order $n$-grams however, the predicted word depends on several last words. In this case, Markov chains with memory (of order $m$) can be applied:

\[ P(X_{n}=x \mid X_{n-1}=x_{n-1}, X_{n-2}=x_{n-2}, \ldots , X_1=x_1) = P(X_{n}=x \mid X_{n-1}=x_{n-1}, X_{n-2} = x_{n-2}, \ldots , X_{n-m}=x_{n-m}), \] when $n>m$

Of course, it might happen that the given $n-1$-gram will not be among those that were created from the training set. In that case, we will use back-off models to allocate some probability mass to unseen word combinations. For example, we could use the last two words to predict the next word if we had examples of the particular trigram. If we don’t, we back-off to using the respective bigram. If we don’t have counts to compute the next word probability using the last word, we estimate unconditional word probabilities. In other words, we only “back off” to a lower-order $n$-gram if we have no examples for a higher-order $n$-gram.

In order for a back-off model to keep the correct probability distribution (so that the total probability does not exceed 1), we have to decrease probabilities for the higher-order $n$-grams to save some probability mass for the lower order $n$-grams. This is called smoothing or discounting. Often, a function (denoted as e.g. $\alpha$) is used to distribute this probability mass.

Katz back-off is a kind of back-off used with discounting. If the $n$-gram has non-zero counts, the next word probability is based on a conditional probability given the $n$-gram with a discount. Otherwise, we back-off to a lower $n$-gram:

\[ P_{BO}(w_n \mid w_{n-N+1}^{n-1})= \begin{cases} P^*(w_n \mid w_{n-N+1}^{n-1}), & \text{if $C(w^n_{n-N+1})>0$}.\\ \alpha(w^{n-1}_{n-N+1})P_{BO}(w_n \mid w^{n-1}_{n-N+2}), & \text{otherwise}. \end{cases} \] Katz back-off is often combined with Good-Turing smoothing. However, empirical examples show that absolute discounting also can give good estimates on $n$-gram frequencies. So the exact way to implement the algorithm is still to be decided.

The final algorithm will be implemented in a Shiny app with the entered phrase as an input and the suggested word as an output. The following challenges can be expected:

Selecting a training corpus that would be large enough to be representative and not too big to process
Selecting $n$-grams and words that are frequent enough to provide enough information to keep the model not too big
Implementing the aforementioned back-off model together with probability smoothing to account for unseen $n$-grams and words
Assessing the model quality using prediction accuracy rates and/or perplexity

Capstone Milestone Report

Vladas Alesius

February 12, 2023

Introduction

Data Processing

Exploratory Data Analysis

Modelling Plans