Table of contents

Introduction

The goal of this whole project is to come up with an algorithm to predict the next word, and built a word suggestion application using the algorithm. Source data in four languages (English, German, Russian and Finnish) to build the algorithm was provided by the instructors. Since 1) the data set in English is expected to be used for the exercises and 2) I have little knowledge on the other three languages, I use the English data set for this project.

The objective of this report is 1) load the given data sets, 2) explore and give a brief summary of them, and 3) come up with a plan to develop an algorithm and application.

Data loading

Loading data sets

US_B <- readLines("final/en_US/en_US.blogs.txt")
US_N <- readLines("final/en_US/en_US.news.txt")
US_T <- readLines("final/en_US/en_US.twitter.txt")

Check the data sets

Check the size of the data sets

  • Blogs: 899288 lines and 2.605643210^{8} bytes
  • News: 1010242 lines and 2.617590510^{8} bytes
  • Twitter: 2360148 lines and 3.160373410^{8} bytes

These data sets are too large to conduct further analyses. Therefore, I randomly pick some lines from each data set and continue the analysis. Since dealing with over 1000 lines slows down calculation processes significantly, I select 1000 lines from each data set and use for the further calculation.

Check sampled sets

  • Blogs: 1000 lines and 2.8561610^{5} bytes
  • News: 1000 lines and 2.559610^{5} bytes
  • Twitter: 1000 lines and 1.3667210^{5} bytes

Data cleaning

I utilize tm package which is an R package specialized for text mining.

Cleaning the data sets

This cleaning process includes

  • Convert uppercase to lower case
  • Remove punctuation
  • Remove numbers
  • Stemming (remove affiexes)
  • Remove offensive words
  • Remove white spaces

In my option, whether a word or sentence is offensive or not heavily depends on context. Any words could be used in offensive ways, and some potentially offensive words could be used in non-offensive ways. Detecting offensive usage of words and eliminate them requires another level of language processing skills. Therefore, I stick with eliminating seven absolutely offensive words (“shit”,“piss”,“fuck”,“cunt”,“cocksucker”,“motherfucker”,“tits”) from the data sets.

After the cleaning proccess, text lines became like examples below.

  • Blog: the faith it take to give someon a second chanc to believ in someon dream to walk with someon through a struggl is direct tie to your faith in god
  • News: at one point today assembl speaker sheila oliv dessex said she did not think the chief bill in the packag would pass until may but after lastminut negoti includ an intervent by christi the assembl pass that bill after lawmak remov a provis to let fulltim worker with fewer than year on the job switch to a kstyle plan some lawmak worri that could hurt the stabil of the pension system
  • Twitter: will the bruin be play black and yellow for the next month until the start of next season

Data exploration

Frequently used words

Calculate Term Document Matrix (TDM: reflect the number of times each word in the corpus is fund in each of the documents) to find words with the highest frequency of usage.

Top 20 words with the highest frequency of usage:

  • Blog: the, and, that, for, with, you, was, this, have, but, not, are, from, all, her, she, they, when, had, his
  • News: the, and, for, that, with, said, was, have, but, his, are, from, has, year, not, they, this, who, out, will
  • Twitter: the, you, and, for, that, your, have, just, not, this, but, with, like, what, are, get, love, good, thank, know

Compare the frequency of usage among the data sources (Blog, News and Twitter): plot 200 words with the highest frequency of usage. Note that “the” and “and” had significantly high frequencies (the = Blog: 1973, News: 1911, Twitter: 392, and = Blog: 1124, News: 870, Twitter: 189); therefire the two words were excluded in the graphs. X and Y axes show the requency of usage of words in each data source.

The word frequencies in Blog and News sources are similar compared to that in Twitter.

Examine bigram words instead of single word.

Top 20 words with the highest frequency of usage:

  • Blog: of the, in the, to the, to be, on the, and i, and the, for the, i am, i have, in a, with the, that i, is a, it is, i was, it was, go to, at the, i had
  • News: in the, of the, on the, for the, to the, at the, and the, in a, with the, as a, by the, he was, to be, from the, is a, of a, want to, with a, for a, it was
  • Twitter: in the, for the, of the, to be, at the, go to, on the, thank for, to the, i have, i love, you know, do you, have a, i cant, i dont, thank you, follow me, for a, in a

How many bigram words were collected?

  • Blog: 27441
  • News: 24031
  • Twitter: 9298

Compare bigram word counts among data sources (Blog, News, and Twitter): plot 200 words with the highest frequencies.

Summary

From the data exploration, I found that:

Because of this uniqueness in Twitter, I plan to develop an algorithm to predict the next word for the Twitter service. Since Twitter has a limitation on the number of characters to use in a post, I would imagine more abbreviations are used in this service. Also, I imagine emoticons and emojis are often used as well. I would spend next several weeks to take into account these Twitter specific features and develop an algorithm and an application.