Capstone: Milestone Report

Elijah Appiah
January 08, 2022

Introduction

This project analyzes the HC Corpora Dataset with the end goal of creating a Shiny App for predicting n-grams. This first milestone report summarizes an exploratory data analysis.

File Summary

Three data files sourced from blogs, news, and twitter were read into R. The news file had hidden null characters preventing a full file read and these null characters required hand deletion with Notepad++ prior to file loading.

f_names f_size f_lines n_char n_words pct_n_char pct_lines pct_words
blogs 200.4242 899288 208361438 37334131 0.36 0.21 0.37
news 196.2775 1010242 203791400 34372528 0.35 0.24 0.34
twitter 159.3641 2360148 162385035 30373583 0.28 0.55 0.30

Processing files of this size pushed up against R’s memory limits and ran slowly. To facilitate analysis, we sampled ten percent of the lines from each file. We cleaned the sample and created n-grams. To further speed processing, we subsetted the n-grams to those that covered 90% of the sample phrases. A fully reproducible version of this data analysis is available on Github.

Uni-grams

The corpora are populated with many acronyms and abbreviations such as “rt” for re-tweet, “lol” for laugh out loud, “ic” for I see. Notably, we chose to leave the short hand “im” for I am and “dont” for don’t / do not as is, hence they show up as uni-grams.

Uni-gram Wordcloud

Word distribution can be summarized with a word cloud, where word size/color represents frequency. The words, “im”, and “time” show up as most frequent followed by “people”, “dont”, “day”, and “love”. This is a popular visual method, but we prefer the relative frequency column plots shown below.

Uni-grms, By Source

The different files - blogs, news, twitter - had different word relative frequencies. Notice that in terms of most frequent words, “rt” occurs only on twitter, “ic” and “donc” only in blogs, and “city”, “percent”, “county” only in news.

Uni-gram Distribution

Distributions were created for each set of n-grams, based on relative frequency.

Bi-gram Distribution

Tri-gram Distribution

Quad-gram Distribution

N-gram Prediction Model

I anticipate using the n-gram tables created for bi-gram, tri-grams, and quad-grams as the basis for prediction. The user will input a word, the model will find the bi-gram with the greatest relative frequency given that word. Similarly, the tri-gram table will be used for making predictions from two word entries and so on.

word1 word2 word3 word4 n proportion coverage
the end of the 806 8.93e-05 0.0000893
at the end of 656 7.27e-05 0.0001619
the rest of the 651 7.21e-05 0.0002340
for the first time 613 6.79e-05 0.0003019
at the same time 506 5.60e-05 0.0003580
is going to be 482 5.34e-05 0.0004113

Notice in the guad-gram table, that the 4-grams are separated by word and arranged by relative frequency. When the user inputs three words, the model matches those words and then finds the fourth word with the greatest relative frequency. Cases where there is no match, or where more than three words are entered, will have random completion.