1. Summary of Corpus Files
2. Data Processing and Exploratory Data Analysis
3. N-Gram Token Frequency (Stopwords Removed)
4. N-Gram Token Frequency (Stopwords Included)
5. Relationship Between Unique Features and Corpus Coverage Levels
6. Next Steps: Predictive Modelling
Appendix A: Source Code Listings

1. Summary of Corpus Files

The given text datasets are from blogs, news, and twitter. Each line in each dataset represents a distinct entry (eg. a sentence, a paragraph, or a tweet). Table 1: descriptive statistics with respect to each of these files. Figure 1: summary statistics visually. Observations: 1. The Twitter dataset is smaller by characters and by words, but much longer by lines. 2. The news dataset has the longest average word length whereas the Twitter dataset has the shortest.

Table 1: Summary Statistics Calculated for Each US English Corpus File
Document	Character Count	Word Count	Line Count	Mean Line Length (Characters)	Mean Line Length (Words)	Mean Word Length
blogs	206,824,505	37,570,839	899,288	229.98695	41.77843	5.504921
news	203,223,159	34,494,539	1,010,242	201.16285	34.14483	5.891459
twitter	162,096,031	30,451,128	2,360,148	68.68045	12.90221	5.323154

2. Data Processing and Exploratory Data Analysis

For exploratory data analyses, random 1% samples of the corpus datasets are taken due to resource constraints. The text samples are combined into a single corpus and is then split into tokens (words), removing punctuation, numbers, URLs, and profanity words. All tokens are converted to lowercase. Any token that appears fewer than five times in the corpus is removed. Additionally, the tokenized corpus is used to create two more tokenized versions of the corpus, grouping tokens into 2-grams (overlapping word pairs) and 3-grams (overlapping word triplets). For each tokenized dataset, two graphs listing each unique token and its frequency are created: one includes and the other excludes stopwords.

Table 2 shows a phenomenon of removing stopwords. Removing these relatively small number of unique stopwords decreases the total word counts considerably. Stopwords are used very frequently, but do not affect meaning much.

Table 2: Total and Unique Token Counts
Stopwords Included	Total Tokens	Unique Tokens
Yes	931,129	12,702
No	481,677	12,532

3. N-Gram Token Frequency (Stopwords Removed)

Figure 2 shows a wordcloud generated from the 1-gram no-stopwords frequency matrix. Figures 3, 4 and 5 display the top 30 no-stopwords 1-, 2- and 3-grams by frequency, respectively.

Figure 2: Wordcloud (Stopwords Removed)

4. N-Gram Token Frequency (Stopwords Included)

Figure 6 shows a wordcloud generated from the 1-gram stopword-inclusive frequency matrix. Figures 7, 8 and 9 display the top 30 stopword-inclusive 1-, 2- and 3-grams by frequency, respectively. It can be seen from these figures that stopwords are indeed used very frequently.

Figure 6: Wordcloud (Stopwords Included)

5. Relationship Between Unique Features and Corpus Coverage Levels

Table 3 lists the proportion of unique tokens needed to cover 50%, 90%, 98% and 99% of the corpus with stopwords removed. Figure 10 shows the relationship between the coverage of the no-stopwords corpus and the proportion of unique tokens needed.

Table 3: Proportions of Unique Featured Required for Certain Coverage Levels of the Corpus (Stopwords Removed)
Proportion of Corpus Covered	Proportion of Unique Features Required
0.50	0.0573731
0.90	0.4838813
0.98	0.8503032
0.99	0.9231567

Table 4 lists the proportion of unique tokens needed to cover 50%, 90%, 98% and 99% of the corpus with stopwords included. Figure 11 shows the relationship between the coverage of the stopwords-inclusive corpus and the proportion of unique tokens needed. Compared to the no-stopwords corpus, fewer unique features are required to meet each coverage level, due to the very high frequency of many stopwords.

Table 4: Proportions of Unique Featured Required for Certain Coverage Levels of the Corpus (Stopwords Included)
Proportion of Corpus Covered	Proportion of Unique Features Required
0.50	0.0083451
0.90	0.2902692
0.98	0.7410644
0.99	0.8565580

6. Next Steps: Predictive Modelling

The final goal for the analysis of this dataset is to build a model which can accurately predict the next word based on previous words, while minimising resource usage as the final product must function as a web app. Based on reasearch and prior experience, three methods are being considered for the modelling stage. Stupid Backoff is the simplest method, and is known to achieve high accuracy depending on the amount of data available. Kneser-Ney Smoothing is a similar but more complex method, and may yield higher accuracy. A Recurrent Neural Network (RNN)-based language model is a more advanced option which could likely achieve higher accuracy than the others, but may also require more resources. Experimentation will determine which method or methods are the best solutions for the goals in mind.

Appendix A: Source Code Listings