1. Summary of Corpus Files

The given dataset is comprised of three English language files, each containing textual data from either blogs, news sources or Twitter. The documentation provided with the dataset states that each line of each file represents a separate entry (e.g. a sentence, paragraph, or Tweet). Table 1 provides some descriptive statistics with respect to each of these files, while Figure 1 represents these summary statistics visually. Observations based on these statistics include that the Twitter file is smaller than the others in terms of characters and words, but much longer in terms of lines. This makes sense considering the average line length for the Twitter file is much lower than for the others. Additionally, the news dataset has the longest average word length, while the Twitter dataset has the shortest.

Table 1: Summary Statistics Calculated for Each US English Corpus File
Document	Character Count	Word Count	Line Count	Mean Line Length (Characters)	Mean Line Length (Words)	Mean Word Length
blogs	206,824,505	37,570,839	899,288	229.98695	41.77843	5.504921
news	203,223,159	34,494,539	1,010,242	201.16285	34.14483	5.891459
twitter	162,096,031	30,451,128	2,360,148	68.68045	12.90221	5.323154

2. Data Processing and Further Exploratory Analysis

Following these basic file statistics, a more in-depth set of exploratory analyses were performed. Due to resource constraints, futher analyses were performed on a random 1% sample of the corpus. Once sampled in this way, the data was processed as follows. The sample documents were combined into a single corpus, which was then split into tokens (words), while removing punctuation, numbers and URLs. Profanity was filtered out, all tokens were converted to lowercase, and each token which appeared fewer than five times in the corpus was removed. Additionally, the tokenized corpus was used to create two more tokenized versions of the corpus, grouping tokens into 2-grams (overlapping word pairs) and 3-grams (overlapping word triplets). For each set of tokenized data, two matrices listing each unique token and its frequency were created: one with so-called stopwords included and one without.

One basic impact of removing stopwords can be seen in Table 2. Removing this relatively small number of unique words resulted in a large decrease in the total word count. Clearly, stopwords are used incredibly frequently, despite having relatively little importance, by definition, in terms of adding meaning to the English language.

Table 2: Total and Unique Token Counts
Stopwords Included	Total Tokens	Unique Tokens
Yes	937,124	12,698
No	483,195	12,527

2.1. N-Gram Token Frequencies (Stopwords Removed)

Figure 2 shows a wordcloud generated from the 1-gram no-stopwords frequency matrix. Figures 3, 4 and 5 display the top 30 no-stopwords 1-, 2- and 3-grams by frequency, respectively.

Figure 2: Wordcloud (Stopwords Removed)

2.2. N-Gram Token Frequencies (Stopwords Included)

Figure 6 shows a wordcloud generated from the 1-gram stopword-inclusive frequency matrix. Figures 7, 8 and 9 display the top 30 stopword-inclusive 1-, 2- and 3-grams by frequency, respectively. It can be seen from these figures that stopwords are indeed used very frequently.

Figure 6: Wordcloud (Stopwords Included)

2.3. Relationship Between Unique Features and Corpus Coverage Levels

Table 3 lists the proportion of unique tokens needed to cover 50%, 90%, 98% and 99% of the corpus with stopwords removed, while Figure 10 graphs the relationship between the coverage of the no-stopwords corpus and the proportion of unique tokens needed.

Table 3: Proportions of Unique Featured Required for Certain Coverage Levels of the Corpus (Stopwords Removed)
Proportion of Corpus Covered	Proportion of Unique Features Required
0.50	0.0567574
0.90	0.4841542
0.98	0.8508023
0.99	0.9228866

Table 4 lists the proportion of unique tokens needed to cover 50%, 90%, 98% and 99% of the corpus with stopwords included, while Figure 11 graphs the relationship between the coverage of the stopwords-inclusive corpus and the proportion of unique tokens needed. Compared to the no-stopwords corpus, fewer unique features are required to meet each coverage level, due to the very high frequency of many stopwords.

Table 4: Proportions of Unique Featured Required for Certain Coverage Levels of the Corpus (Stopwords Included)
Proportion of Corpus Covered	Proportion of Unique Features Required
0.50	0.0082690
0.90	0.2881556
0.98	0.7405891
0.99	0.8565916

3. Next Steps: Predictive Modelling

The final goal for the analysis of this dataset is to build a model which can accurately predict the next word based on previous words, while minimising resource usage as the final product must function as a web app. Based on reasearch and prior experience, three methods are being considered for the modelling stage. Stupid Backoff is the simplest method, and is known to achieve high accuracy depending on the amount of data available. Kneser-Ney Smoothing is a similar but more complex method, and may yield higher accuracy. A Recurrent Neural Network (RNN)-based language model is a more advanced option which could likely achive higher accuracy than the others, but may also require more resources. Experimentation will determine which method or methods are suitable solutions for the goal in mind.

Appendix A: Source Code Listings